HFEPX Metric Hub

Accuracy & Pass Rate Metric Papers In CS.LG

Updated from current HFEPX corpus (Mar 1, 2026). 19 papers are grouped in this metric page.

Read Full Context

Updated from current HFEPX corpus (Mar 1, 2026). 19 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequently cited benchmark: Ad-Bench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 15, 2026.

Papers: 19 Last published: Feb 15, 2026 Global RSS

Researcher Quick Triage

Use this page to compare metric behavior across protocols and benchmarks before selecting your reporting stack. Quality band: Medium .

Metric Coverage

94.7%

18 sampled papers include metric names.

Benchmark Anchoring

21.1%

Papers with explicit dataset/benchmark anchors for fair comparison.

Quality Controls

0.0%

0 papers report calibration/adjudication/IAA controls.

19 papers are not low-signal flagged in this sample.
Use the protocol matrix below to avoid comparing metrics across incompatible eval setups.

Primary action: Treat this as directional signal only; metric reporting is present but benchmark anchoring is still thin.

Why This Matters (Expanded)

Why This Matters For Eval Research

22.2% of papers report explicit human-feedback signals, led by expert verification.
automatic metrics appears in 89.5% of papers in this hub.
Ad-Bench is a recurring benchmark anchor for cross-paper comparisons in this page.

Metric Notes (Expanded)

Metric-Driven Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Metric Interpretation

accuracy is reported in 83.3% of hub papers (15/19); compare with a secondary metric before ranking methods.
cost is reported in 16.7% of hub papers (3/19); compare with a secondary metric before ranking methods.

Benchmark Context

Ad-Bench appears in 5.6% of hub papers (1/19); use this cohort for benchmark-matched comparisons.
Ama-Bench appears in 5.6% of hub papers (1/19); use this cohort for benchmark-matched comparisons.

Start Here (Metric-Reliable First 6)

Ranked for metric reporting completeness and comparability.

AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents
Feb 15, 2026 · Citations: 0 · Score: 9.0

Metrics: Pass@1, Pass@3 · Eval: Simulation Env
Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences
Feb 25, 2026 · Citations: 0 · Score: 9.0

Metrics: Accuracy · Eval: Automatic Metrics
AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications
Feb 26, 2026 · Citations: 0 · Score: 8.0

Metrics: Accuracy · Eval: Automatic Metrics
SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents
Feb 25, 2026 · Citations: 0 · Score: 8.0

Metrics: Pass@1, Latency · Eval: Automatic Metrics
Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards
Feb 20, 2026 · Citations: 0 · Score: 7.5

Metrics: Accuracy, Win rate · Eval: Llm As Judge, Automatic Metrics
APEX-Agents
Jan 20, 2026 · Citations: 0 · Score: 7.0

Metrics: Pass@1 · Eval: Automatic Metrics

Metric Protocol Matrix (Top 10)

Compare metric, benchmark, and evaluation context side by side.

Paper	Metrics	Benchmarks	Eval Modes	Quality Controls
AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents Feb 15, 2026	Pass@1, Pass@3	Ad Bench	Simulation Env	Not reported
Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences Feb 25, 2026	Accuracy	LiveCodeBench, Mathbench	Automatic Metrics	Not reported
AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications Feb 26, 2026	Accuracy	Ama Bench	Automatic Metrics	Not reported
SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents Feb 25, 2026	Pass@1, Latency	SWE Bench, SWE Bench Verified	Automatic Metrics	Not reported
Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards Feb 20, 2026	Accuracy, Win rate	Not reported	Llm As Judge, Automatic Metrics	Not reported
APEX-Agents Jan 20, 2026	Pass@1	Not reported	Automatic Metrics	Not reported
Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing Oct 14, 2025	Accuracy	Not reported	Automatic Metrics	Not reported
GATES: Self-Distillation under Privileged Context with Consensus Gating Feb 24, 2026	Accuracy	Not reported	Automatic Metrics	Not reported
Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training Feb 26, 2026	Accuracy	Not reported	Automatic Metrics	Not reported
Distill and Align Decomposition for Enhanced Claim Verification Feb 25, 2026	Accuracy, F1	Not reported	Human Eval, Automatic Metrics	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (22.2% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (22.2% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).
Moderate: Papers with known rater population

Coverage is usable but incomplete (27.8% vs 35% target).
Moderate: Papers with known annotation unit

Coverage is usable but incomplete (27.8% vs 35% target).

Strengths

Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.
Agentic evaluation appears in 61.1% of papers.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
LLM-as-judge appears without enough inter-annotator agreement reporting.

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (Ad-Bench vs Ama-Bench) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.

Recommended Queries

Judge vs Human Agreement Benchmark Slice: Ad-Bench Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
LLM-as-judge appears without enough inter-annotator agreement reporting.
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Top Metrics

Accuracy (15)
Cost (3)
Pass@1 (3)
Latency (2)

Evaluation Modes

Automatic Metrics (17)
Simulation Env (3)
Llm As Judge (2)
Human Eval (1)

Top Benchmarks

Ad Bench (1)
Ama Bench (1)
LiveCodeBench (1)
Mathbench (1)

Agentic Mix

Long Horizon (11)
Web Browsing (1)

Top Papers Reporting This Metric

AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents
Lingxiang Hu, Yiding Sun, Tianle Xia, Wenwei Li, Ming Xu · Feb 15, 2026 · Citations: 0

Simulation Env Coding

While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real-world environments has become a critical problem.
APEX-Agents
Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman · Jan 20, 2026 · Citations: 0

Automatic Metrics Law

We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate…
Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences
Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu · Feb 25, 2026 · Citations: 0

Automatic Metrics Math

Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates.
AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications
Yujie Zhao, Boqin Yuan, Junbo Huang, Haocheng Yuan, Zhongming Yu · Feb 26, 2026 · Citations: 0

Automatic Metrics General

To bridge this gap, we introduce AMA-Bench (Agent Memory with Any length), which evaluates long-horizon memory for LLMs in real agentic applications.
SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents
Patrick Tser Jern Kon, Archana Pradeep, Ang Chen, Alexander P. Ellis, Warren Hunt · Feb 25, 2026 · Citations: 0

Automatic Metrics Coding

Our approach combines supervised fine-tuning on expert-augmented trajectories with agentic reinforcement learning that explicitly discourages degenerative looping and unproductive expert collaboration.
Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing
Rongzhi Zhang, Liqin Ye, Yuzhao Heng, Xiang Chen, Tong Yu · Oct 14, 2025 · Citations: 0

Automatic Metrics Coding

Finally, we demonstrate efficiency enhancements across three downstream tasks: preference data synthesis, Pareto frontier approximation and optimization, and distillation of aligned behaviors for intervention-free inference.
Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards
Johannes Ackermann, Michael Noukhovitch, Takashi Ishida, Masashi Sugiyama · Feb 20, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics Math

Reinforcement Learning from Human Feedback (RLHF) or Verifiable Rewards (RLVR) are two key steps in the post-training of modern Language Models (LMs).
KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models
Zukang Xu, Zhixiong Zhao, Xing Hu, Zhixuan Chen, Dawei Yang · Jan 30, 2026 · Citations: 0

Automatic MetricsSimulation Env Coding

Mixture of Experts (MoE) models have achieved great success by significantly improving performance while maintaining computational efficiency through sparse expert activation.
GATES: Self-Distillation under Privileged Context with Consensus Gating
Alex Stein, Furong Huang, Tom Goldstein · Feb 24, 2026 · Citations: 0

Automatic Metrics Math

Held-out in-domain accuracy under asymmetric evaluation improves from 46.0\% to 62.0\%, and average (maj@8) accuracy on public document-free math benchmarks improves from 20.2\% to 35.4\%.
RHYTHM: Reasoning with Hierarchical Temporal Tokenization for Human Mobility
Haoyu He, Haozheng Luo, Yan Chen, Qi R. Wang · Sep 27, 2025 · Citations: 0

Automatic Metrics Coding

Predicting human mobility is inherently challenging due to complex long-range dependencies and multi-scale periodic behaviors.
Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
Tianle Xia, Ming Xu, Lingxiang Hu, Yiding Sun, Wenwei Li · Feb 26, 2026 · Citations: 0

Automatic Metrics General

We propose Search-P1, a framework that introduces path-centric reward shaping for agentic RAG training, comprising two key components: (1) Path-Centric Reward, which evaluates the structural quality of reasoning trajectories through…
Distill and Align Decomposition for Enhanced Claim Verification
Jabez Magomere, Elena Kochkina, Samuel Mensah, Simerjot Kaur, Fernando Acero · Feb 25, 2026 · Citations: 0

Human EvalAutomatic Metrics General

Across six evaluation settings, our trained 8B decomposer improves downstream verification performance to (71.75%) macro-F1, outperforming prompt-based approaches ((+1.99), (+6.24)) and existing RL methods ((+5.84)).
Luna-2: Scalable Single-Token Evaluation with Small Language Models
Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel · Feb 20, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics General

We present Luna-2, a novel architecture that leverages decoder-only small language models (SLMs) into a deterministic evaluation model to reliably compute complex task-specific LLMAJ metrics (e.g.
Synthesis of discrete-continuous quantum circuits with multimodal diffusion models
Florian Fürrutter, Zohim Chandani, Ikko Hamamura, Hans J. Briegel, Gorka Muñoz-Gil · Jun 2, 2025 · Citations: 0

Automatic MetricsSimulation Env General

We benchmark the model over different experiments, analyzing the method's accuracy across varying qubit counts and circuit depths, showcasing the ability of the method to outperform existing approaches in gate counts and under noisy conditi
How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?
Yingqian Cui, Zhenwei Dai, Bing He, Zhan Shi, Hui Liu · Feb 25, 2026 · Citations: 0

Automatic Metrics General

First, we observe pervasive shortcut behavior, where they achieve high accuracy without relying on latent reasoning.
GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL
Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang · Feb 25, 2026 · Citations: 0

Automatic Metrics Coding

Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks.
Think like a Scientist: Physics-guided LLM Agent for Equation Discovery
Jianke Yang, Ohm Venkatachalam, Mohammad Kianezhad, Sharvaree Vadgama, Rose Yu · Feb 12, 2026 · Citations: 0

Automatic Metrics General

We introduce KeplerAgent, an agentic framework that explicitly follows this scientific reasoning process.
Conflict-Aware Fusion: Resolving Logic Inertia in Large Language Models via Structured Cognitive Priors
Qiming Bao, Xiaoxuan Fu, Michael Witbrock · Dec 6, 2025 · Citations: 0

Automatic Metrics Law

We present a controlled evaluation framework consisting of four stress tests: (1) rule deletion (redundant vs.
Evaluating Zero-Shot and One-Shot Adaptation of Small Language Models in Leader-Follower Interaction
Rafael R. Baptista, André de Lima Salgado, Ricardo V. Godoy, Marcelo Becker, Thiago Boaventura · Feb 26, 2026 · Citations: 0

Related Metric Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote