HFEPX Metric Hub

Relevance In CS.LG Papers

Updated from current HFEPX corpus (Apr 9, 2026). 13 papers are grouped in this metric page.

Read Full Context

Updated from current HFEPX corpus (Apr 9, 2026). 13 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Human Eval. Common annotation unit: Pairwise. Frequently cited benchmark: Rewardbench. Common metric signal: relevance. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Apr 8, 2026.

Papers: 13 Last published: Apr 8, 2026 Global RSS

Researcher Quick Triage

Use this page to compare metric behavior across protocols and benchmarks before selecting your reporting stack. Quality band: Developing .

Metric Coverage

30.8%

4 sampled papers include metric names.

Benchmark Anchoring

7.7%

Papers with explicit dataset/benchmark anchors for fair comparison.

Quality Controls

0.0%

0 papers report calibration/adjudication/IAA controls.

13 papers are not low-signal flagged in this sample.
Use the protocol matrix below to avoid comparing metrics across incompatible eval setups.

Primary action: Treat this as directional signal only; metric reporting is present but benchmark anchoring is still thin.

Why This Matters (Expanded)

Why This Matters For Eval Research

15.4% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 23.1% of papers in this hub.
Rewardbench is a recurring benchmark anchor for cross-paper comparisons in this page.

Metric Notes (Expanded)

Metric-Driven Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly unspecified rater pools, and annotation is commonly pairwise annotation; use this to scope replication staffing.
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Metric Interpretation

relevance is reported in 100% of hub papers (13/13); compare with a secondary metric before ranking methods.
accuracy is reported in 30.8% of hub papers (4/13); compare with a secondary metric before ranking methods.

Benchmark Context

Rewardbench appears in 7.7% of hub papers (1/13); use this cohort for benchmark-matched comparisons.

Start Here (Metric-Reliable First 6)

Ranked for metric reporting completeness and comparability.

Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Apr 8, 2026 · Citations: 0 · Score: 9.0

Metrics: Accuracy, Helpfulness · Eval: Human Eval, Automatic Metrics
MemRerank: Preference Memory for Personalized Product Reranking
Mar 31, 2026 · Citations: 0 · Score: 7.5

Metrics: Accuracy, Relevance · Eval: Automatic Metrics
Multi-Agent Environments for Vehicle Routing Problems
Nov 21, 2024 · Citations: 0 · Score: 4.4

Metrics: Relevance · Eval: Simulation Env
CodeRefine: A Pipeline for Enhancing LLM-Generated Code Implementations of Research Papers
Aug 23, 2024 · Citations: 0 · Score: 4.4

Metrics: Relevance · Eval: Automatic Metrics
Forgetting to Witness: Efficient Federated Unlearning and Its Visible Evaluation
Apr 6, 2026 · Citations: 0 · Score: 0.0

Metrics: Not reported · Eval: Not reported
Reproducibility study on how to find Spurious Correlations, Shortcut Learning, Clever Hans or Group-Distributional non-robustness and how to fix them
Apr 6, 2026 · Citations: 0 · Score: 0.0

Metrics: Not reported · Eval: Not reported

Metric Protocol Matrix (Top 10)

Compare metric, benchmark, and evaluation context side by side.

Paper	Metrics	Benchmarks	Eval Modes	Quality Controls
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization Apr 8, 2026	Accuracy, Helpfulness	Rewardbench	Human Eval, Automatic Metrics	Not reported
MemRerank: Preference Memory for Personalized Product Reranking Mar 31, 2026	Accuracy, Relevance	Not reported	Automatic Metrics	Not reported
Multi-Agent Environments for Vehicle Routing Problems Nov 21, 2024	Relevance	Not reported	Simulation Env	Not reported
CodeRefine: A Pipeline for Enhancing LLM-Generated Code Implementations of Research Papers Aug 23, 2024	Relevance	Not reported	Automatic Metrics	Not reported
Forgetting to Witness: Efficient Federated Unlearning and Its Visible Evaluation Apr 6, 2026	Not reported	Not reported	Not reported	Not reported
Reproducibility study on how to find Spurious Correlations, Shortcut Learning, Clever Hans or Group-Distributional non-robustness and how to fix them Apr 6, 2026	Not reported	Not reported	Not reported	Not reported
Abnormal Head Movements in Neurological Conditions: A Knowledge-Based Dataset with Application to Cervical Dystonia Apr 2, 2026	Not reported	Not reported	Not reported	Not reported
Decidable By Construction: Design-Time Verification for Trustworthy AI Mar 26, 2026	Not reported	Not reported	Not reported	Not reported
Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA Mar 25, 2026	Not reported	Not reported	Not reported	Not reported
L2GTX: From Local to Global Time Series Explanations Mar 13, 2026	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (15.4% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (7.7% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (0% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (15.4% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Annotation unit is under-specified (15.4% coverage).

Suggested Next Analyses

Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.
Track metric sensitivity by reporting both relevance and accuracy.

Recommended Queries

Human Eval Protocols Benchmark Slice: Rewardbench Metric Slice: relevance Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Top Metrics

Relevance (13)
Accuracy (4)
Agreement (2)
Faithfulness (2)

Evaluation Modes

Automatic Metrics (3)
Human Eval (1)
Simulation Env (1)

Top Benchmarks

Rewardbench (1)

Agentic Mix

Long Horizon (1)
Multi Agent (1)

Top Papers Reporting This Metric

Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Qiyao Ma, Dechen Gao, Rui Cai, Boqi Zhao, Hanchu Zhou · Apr 8, 2026 · Citations: 0

Human EvalAutomatic Metrics General

Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values.
MemRerank: Preference Memory for Personalized Product Reranking
Zhiyuan Peng, Xuyang Wu, Huaixiao Tou, Yi Fang, Yu Gong · Mar 31, 2026 · Citations: 0

Automatic Metrics General

LLM-based shopping agents increasingly rely on long purchase histories and multi-turn interactions for personalization, yet naively appending raw history to prompts is often ineffective due to noise, length, and relevance mismatch.
Multi-Agent Environments for Vehicle Routing Problems
Ricardo Gama, Ricardo Cunha, Daniel Fuertes, Carlos R. del-Blanco, Hugo L. Fernandes · Nov 21, 2024 · Citations: 0

Simulation Env Coding

Here, we propose MAEnvs4VRP library, a unified framework for multi-agent vehicle routing environments that supports classical, dynamic, stochastic, and multi-task problem variants within a single modular design.
CodeRefine: A Pipeline for Enhancing LLM-Generated Code Implementations of Research Papers
Ekaterina Trofimova, Emil Sataev, Abhijit Singh Jowhari · Aug 23, 2024 · Citations: 0

Automatic Metrics Coding

Evaluations on diverse scientific papers demonstrate CodeRefine's ability to improve code implementation from the paper, potentially accelerating the adoption of cutting-edge algorithms in real-world applications.
Forgetting to Witness: Efficient Federated Unlearning and Its Visible Evaluation
Houzhe Wang, Xiaojie Zhu, Chi Chen · Apr 6, 2026 · Citations: 0
Reproducibility study on how to find Spurious Correlations, Shortcut Learning, Clever Hans or Group-Distributional non-robustness and how to fix them
Ole Delzer, Sidney Bender · Apr 6, 2026 · Citations: 0
Abnormal Head Movements in Neurological Conditions: A Knowledge-Based Dataset with Application to Cervical Dystonia
Saja Al-Dabet, Sherzod Turaev, Nazar Zaki · Apr 2, 2026 · Citations: 0
Decidable By Construction: Design-Time Verification for Trustworthy AI
Houston Haynes · Mar 26, 2026 · Citations: 0
Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA
Saahil Mathur, Ryan David Rittner, Vedant Ajit Thakur, Daniel Stuart Schiff, Tunazzina Islam · Mar 25, 2026 · Citations: 0
L2GTX: From Local to Global Time Series Explanations
Ephrem Tibebe Mekonnen, Luca Longo, Lucas Rizzo, Pierpaolo Dondio · Mar 13, 2026 · Citations: 0
CIRCUS: Circuit Consensus under Uncertainty via Stability Ensembles
Swapnil Parekh · Feb 28, 2026 · Citations: 0
Scaling Search Relevance: Augmenting App Store Ranking with LLM-Generated Judgments
Evangelia Christakopoulou, Vivekkumar Patel, Hemanth Velaga, Sandip Gaikwad · Feb 26, 2026 · Citations: 0
InfoTok: Information-Theoretic Regularization for Capacity-Constrained Shared Visual Tokenization in Unified MLLMs
Lv Tang, Tianyi Zheng, Bo Li, Xingyu Li · Feb 2, 2026 · Citations: 0

Related Metric Hubs

Relevance Metric Papers CS.LG Human Feedback And Eval Papers Relevance In CS.AI Papers CS.LG Papers (Last 90 Days) CS.LG Papers (Last 60 Days) CS.AI Human Feedback And Eval Papers Helpfulness In CS.CL Papers (12) Helpfulness Metric Papers (15) Relevance + Automatic Metrics Metric Papers (Last 120 Days) (18) Relevance + Automatic Metrics Metric Papers (Last 30 Days) (10) Relevance + Automatic Metrics Metric Papers (Last 45 Days) (14) Relevance + Automatic Metrics Metric Papers (Last 60 Days) (16) Relevance + Automatic Metrics Metric Papers (Last 90 Days) (17) Relevance + Automatic Metrics Metric Papers (22) Relevance + General Metric Papers (Last 120 Days) (11) Relevance + General Metric Papers (Last 45 Days) (10) Relevance + General Metric Papers (Last 60 Days) (11) Relevance + General Metric Papers (Last 90 Days) (11) Relevance + General Metric Papers (12) Relevance + Pairwise Preference Metric Papers (Last 120 Days) (10) Relevance + Pairwise Preference Metric Papers (11) Relevance In CS.CL Papers (40) Relevance Metric Papers (65) Throughput In CS.LG Papers (17)

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote