Skip to content
← Back to explorer

HFEPX Hub

Automatic Metrics + Pairwise Preference (Last 30 Days)

Updated from current HFEPX corpus (Mar 1, 2026). 18 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Mar 1, 2026). 18 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Inter Annotator Agreement Reported. Frequently cited benchmark: Charteditbench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 13, 2026.

Papers: 18 Last published: Feb 13, 2026 Global RSS Tag RSS
Automatic MetricsPairwise PreferenceLast 30d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

High-Signal Coverage

100.0%

18 / 18 sampled papers are not low-signal flagged.

Replication-Ready Set

3

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

0

Papers containing both `human_eval` and `llm_as_judge`.

  • 3 papers are replication-ready (benchmark + metric + explicit evaluation mode).
  • 0 papers support judge-vs-human agreement analysis.
  • 3 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Start with the top 2 papers in “Start Here”, then validate assumptions in the protocol matrix.

Why This Matters (Expanded)

Why This Matters For Eval Research

  • 100% of papers report explicit human-feedback signals, led by pairwise preferences.
  • automatic metrics appears in 100% of papers in this hub.
  • Charteditbench is a recurring benchmark anchor for cross-paper comparisons in this page.
Protocol Notes (Expanded)

Protocol Takeaways

  • Most common quality-control signal is inter-annotator agreement reporting (11.1% of papers).
  • Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.
  • Stratify by benchmark (Charteditbench vs LiveCodeBench) before comparing methods.

Benchmark Interpretation

  • Charteditbench appears in 5.6% of hub papers (1/18); use this cohort for benchmark-matched comparisons.
  • LiveCodeBench appears in 5.6% of hub papers (1/18); use this cohort for benchmark-matched comparisons.

Metric Interpretation

  • accuracy is reported in 22.2% of hub papers (4/18); compare with a secondary metric before ranking methods.
  • agreement is reported in 16.7% of hub papers (3/18); compare with a secondary metric before ranking methods.
Researcher Checklist (Expanded)

Researcher Checklist

  • Strong: Papers with explicit human feedback

    Coverage is strong (100% vs 45% target).

  • Gap: Papers reporting quality controls

    Coverage is a replication risk (16.7% vs 30% target).

  • Moderate: Papers naming benchmarks/datasets

    Coverage is usable but incomplete (22.2% vs 35% target).

  • Strong: Papers naming evaluation metrics

    Coverage is strong (66.7% vs 35% target).

  • Gap: Papers with known rater population

    Coverage is a replication risk (11.1% vs 35% target).

  • Strong: Papers with known annotation unit

    Coverage is strong (38.9% vs 35% target).

Strengths

  • Strong human-feedback signal (100% of papers).

Known Gaps

  • Only 16.7% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Rater population is under-specified (11.1% coverage).

Suggested Next Analyses

  • Stratify by benchmark (Charteditbench vs LiveCodeBench) before comparing methods.
  • Track metric sensitivity by reporting both accuracy and agreement.
Recommended Queries (Expanded)

Recommended Queries

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper HF Signal Eval Modes Benchmarks Metrics QC
SCOPE: Selective Conformal Optimized Pairwise LLM Judging

Feb 13, 2026

Yes Automatic Metrics MT Bench , LMSYS Chatbot Arena Error rate Calibration
MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks

Feb 18, 2026

Yes Automatic Metrics Memoryarena Recall Not Reported
Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences

Feb 25, 2026

Yes Automatic Metrics LiveCodeBench , Mathbench Accuracy Not Reported
Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language

Feb 21, 2026

Yes Automatic Metrics Not Reported Agreement Inter Annotator Agreement Reported , Adjudication
Same Words, Different Judgments: Modality Effects on Preference Alignment

Feb 26, 2026

Yes Automatic Metrics Not Reported Agreement Inter Annotator Agreement Reported
ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models

Feb 17, 2026

Yes Automatic Metrics Charteditbench Not Reported Not Reported
Multi-Objective Alignment of Language Models for Personalized Psychotherapy

Feb 17, 2026

Yes Automatic Metrics Not Reported Agreement , Cost Not Reported
Modeling Distinct Human Interaction in Web Agents

Feb 19, 2026

Yes Automatic Metrics Not Reported Accuracy Not Reported
PrivAct: Internalizing Contextual Privacy Preservation via Multi-Agent Preference Training

Feb 14, 2026

Yes Automatic Metrics Not Reported Helpfulness Not Reported
CAMEL: Confidence-Gated Reflection for Reward Modeling

Feb 24, 2026

Yes Automatic Metrics Not Reported Accuracy , Cost Not Reported
RLHFless: Serverless Computing for Efficient RLHF

Feb 26, 2026

Yes Automatic Metrics Not Reported Cost Not Reported
DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs

Feb 25, 2026

Yes Automatic Metrics Not Reported Accuracy Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal SCOPE: Selective Conformal Optimized Pairwise LLM J… MemoryArena: Benchmarking Agent Memory in Interdepe… Duel-Evolve: Reward-Free Test-Time Scaling via LLM…
Human Feedback Pairwise PreferencePairwise PreferencePairwise Preference
Evaluation Modes Automatic MetricsAutomatic MetricsAutomatic Metrics
Benchmarks MT Bench, LMSYS Chatbot ArenaMemoryarenaLiveCodeBench, Mathbench
Metrics Error rateRecallAccuracy
Quality Controls CalibrationNot reportedNot reported
Rater Population UnknownUnknownUnknown
Annotation Unit PairwiseUnknownPairwise
Suggested Reading Order (Extended)

This section is intentionally expanded only when needed; use “Start Here” above for a faster pass.

Suggested Reading Order

  1. RLHFless: Serverless Computing for Efficient RLHF

    Start here for detailed protocol reporting and quality-control evidence. Signals: automatic metrics + pairwise preferences. Focus: cost. Abstract: Reinforcement Learning from Human Feedback (RLHF) has been widely applied.

  2. Same Words, Different Judgments: Modality Effects on Preference Alignment

    Start here for detailed protocol reporting and quality-control evidence. Signals: automatic metrics + pairwise preferences. Focus: agreement. Abstract: Preference-based reinforcement learning (PbRL) is the dominant framework for aligning.

  3. DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs

    Start here for detailed protocol reporting and quality-control evidence. Signals: automatic metrics + pairwise preferences. Focus: accuracy. Abstract: This ``one-size-fits-all'' strategy often neglects model-specific and task-specific preferences, resulting.

  4. SCOPE: Selective Conformal Optimized Pairwise LLM Judging

    Include a human-eval paper to calibrate against judge-based evaluation settings. Signals: automatic metrics + pairwise preferences. Focus: MT-Bench / error rate. Abstract: Large language models (LLMs) are increasingly.

  5. MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks

    Include a human-eval paper to calibrate against judge-based evaluation settings. Signals: automatic metrics + pairwise preferences. Focus: Memoryarena / recall. Abstract: MemoryArena supports evaluation across web navigation, preference-constrained.

  6. Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language

    Adds automatic metrics with pairwise preferences for broader protocol coverage within this hub. Signals: automatic metrics + pairwise preferences. Focus: agreement. Abstract: The dataset comprises 436 instances annotated.

  7. Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences

    Adds automatic metrics with pairwise preferences for broader protocol coverage within this hub. Signals: automatic metrics + pairwise preferences. Focus: LiveCodeBench / accuracy. Abstract: Pairwise comparisons, by contrast,.

  8. Modeling Distinct Human Interaction in Web Agents

    Adds automatic metrics with pairwise preferences for broader protocol coverage within this hub. Signals: automatic metrics + pairwise preferences. Focus: accuracy. Abstract: Despite rapid progress in autonomous web.

Known Limitations

Known Limitations

  • Only 16.7% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Rater population is under-specified (11.1% coverage).
  • Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.
Research Utility Snapshot (Detailed)

Research Utility Snapshot

Human Feedback Mix

  • Pairwise Preference (18)
  • Critique Edit (1)
  • Expert Verification (1)
  • Rlaif Or Synthetic Feedback (1)

Evaluation Modes

  • Automatic Metrics (18)

Top Benchmarks

  • Charteditbench (1)
  • LiveCodeBench (1)
  • LMSYS Chatbot Arena (1)
  • Mathbench (1)

Top Metrics

  • Accuracy (4)
  • Agreement (3)
  • Cost (3)
  • Error rate (1)

Rater Population Mix

  • Domain Experts (2)

Quality Controls

  • Inter Annotator Agreement Reported (2)
  • Adjudication (1)
  • Calibration (1)
Coverage diagnostics (sample-based): human-feedback 100.0% · benchmarks 22.2% · metrics 66.7% · quality controls 16.7%.

Top Papers

  • SCOPE: Selective Conformal Optimized Pairwise LLM Judging

    Sher Badshah, Ali Emami, Hassan Sajjad · Feb 13, 2026 · Citations: 0

    Pairwise Preference Automatic Metrics

    Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation.

  • MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks

    Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen · Feb 18, 2026 · Citations: 0

    Pairwise Preference Automatic Metrics Web Browsing

    Existing evaluations of agents with memory typically assess memorization and action in isolation.

  • Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language

    Toheeb Aduramomi Jimoh, Tabea De Wille, Nikola S. Nikolov · Feb 21, 2026 · Citations: 0

    Pairwise Preference Automatic Metrics

    This protocol incorporates context-sensitive interpretation and community-informed guidelines and is accompanied by a comprehensive analysis of inter-annotator agreement to support replication in other African languages.

  • Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences

    Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu · Feb 25, 2026 · Citations: 0

    Pairwise Preference Automatic Metrics

    Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates.

  • Same Words, Different Judgments: Modality Effects on Preference Alignment

    Aaron Broukhim, Nadir Weibel, Eshin Jolly · Feb 26, 2026 · Citations: 0

    Pairwise PreferenceRlaif Or Synthetic Feedback Automatic Metrics

    Preference-based reinforcement learning (PbRL) is the dominant framework for aligning AI systems to human preferences, but its application to speech remains underexplored.

  • Multi-Objective Alignment of Language Models for Personalized Psychotherapy

    Mehrab Beikzadeh, Yasaman Asadollah Salmanpour, Ashima Suvarna, Sriram Sankararaman, Matteo Malgaroli · Feb 17, 2026 · Citations: 0

    Pairwise PreferenceExpert Verification Automatic Metrics

    While AI systems show therapeutic promise, current alignment approaches optimize objectives independently, failing to balance patient preferences with clinical safety.

  • Modeling Distinct Human Interaction in Web Agents

    Faria Huq, Zora Zhiruo Wang, Zhanqiu Guo, Venu Arvind Arangarajan, Tianyue Ou · Feb 19, 2026 · Citations: 0

    Pairwise Preference Automatic Metrics Web Browsing

    In this work, we introduce the task of modeling human intervention to support collaborative web task execution.

  • PrivAct: Internalizing Contextual Privacy Preservation via Multi-Agent Preference Training

    Yuhan Cheng, Hancheng Ye, Hai Helen Li, Jingwei Sun, Yiran Chen · Feb 14, 2026 · Citations: 0

    Pairwise Preference Automatic Metrics Multi Agent

    We propose PrivAct, a contextual privacy-aware multi-agent learning framework that internalizes contextual privacy preservation directly into models' generation behavior for privacy-compliant agentic actions.

  • CAMEL: Confidence-Gated Reflection for Reward Modeling

    Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar · Feb 24, 2026 · Citations: 0

    Pairwise PreferenceCritique Edit Automatic Metrics

    Building on this insight, we propose CAMEL, a confidence-gated reflection framework that performs a lightweight single-token preference decision first and selectively invokes reflection only for low-confidence instances.

  • ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models

    Manav Nitin Kapadnis, Lawanya Baghel, Atharva Naik, Carolyn Rosé · Feb 17, 2026 · Citations: 0

    Pairwise Preference Automatic Metrics

    In practice, users iteratively refine visualizations through multi-turn interactions that require maintaining common ground, tracking prior edits, and adapting to evolving preferences.

  • RLHFless: Serverless Computing for Efficient RLHF

    Rui Wei, Hanfei Yu, Shubham Jain, Yogarajan Sivakumar, Devesh Tiwari · Feb 26, 2026 · Citations: 0

    Pairwise Preference Automatic Metrics

    Reinforcement Learning from Human Feedback (RLHF) has been widely applied to Large Language Model (LLM) post-training to align model outputs with human preferences.

  • DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs

    Yanbin Wei, Jiangyue Yan, Chun Kang, Yang Chen, Hua Liu · Feb 25, 2026 · Citations: 0

    Pairwise Preference Automatic Metrics

    This ``one-size-fits-all'' strategy often neglects model-specific and task-specific preferences, resulting in inaccurate or over-lengthy responses to graph-related queries.

  • Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages

    Somnath Banerjee, Rima Hazra, Animesh Mukherjee · Feb 14, 2026 · Citations: 0

    Pairwise Preference Automatic Metrics

    Yet safety pipelines, benchmarks, and alignment still largely target English and a handful of high-resource languages, implicitly assuming safety and factuality ''transfer'' across languages.

  • CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning

    Zhijiang Tang, Linhua Wang, Jiaxin Qi, Weihao Jiang, Peng Hou · Feb 25, 2026 · Citations: 0

    Pairwise Preference Automatic Metrics

    Image captioning remains a fundamental task for vision language understanding, yet ground-truth supervision still relies predominantly on human-annotated references.

  • Rethinking Metrics for Lexical Semantic Change Detection

    Roksana Goworek, Haim Dubossarsky · Feb 17, 2026 · Citations: 0

    Pairwise Preference Automatic Metrics

    Lexical semantic change detection (LSCD) increasingly relies on contextualised language model embeddings, yet most approaches still quantify change using a small set of semantic change metrics, primarily Average Pairwise Distance (APD) and

  • Investigation for Relative Voice Impression Estimation

    Kenichi Fujita, Yusuke Ijima · Feb 15, 2026 · Citations: 0

    Pairwise Preference Automatic Metrics

    The estimation target is a low-dimensional vector derived from subjective evaluations, quantifying the perceptual shift of the second utterance relative to the first along an antonymic axis (e.g., ``Dark--Bright'').

  • The ASIR Courage Model: A Phase-Dynamic Framework for Truth Transitions in Human and AI Systems

    Hyo Jin Kim · Feb 25, 2026 · Citations: 0

    Pairwise Preference Automatic Metrics

    Although initially formulated for human truth-telling under asymmetric stakes, the same phase-dynamic architecture extends to AI systems operating under policy constraints and alignment filters.

  • Probing Graph Neural Network Activation Patterns Through Graph Topology

    Floriano Tori, Lorenzo Bini, Marco Sorbi, Stéphane Marchand-Maillet, Vincent Ginis · Feb 24, 2026 · Citations: 0

    Pairwise Preference Automatic Metrics

    However, it remains unclear how the topology of a graph interacts with the learned preferences of GNNs.

Related Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.