HFEPX Hub

Automatic Metrics + Pairwise Preference (Last 30 Days)

Updated from current HFEPX corpus (Mar 1, 2026). 18 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Mar 1, 2026). 18 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Inter Annotator Agreement Reported. Frequently cited benchmark: Charteditbench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 13, 2026.

Papers: 18 Last published: Feb 13, 2026 Global RSS Tag RSS

Automatic MetricsPairwise PreferenceLast 30d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (18) Replication-Ready Only (3)

High-Signal Coverage

100.0%

18 / 18 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

3 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
3 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Start with the top 2 papers in “Start Here”, then validate assumptions in the protocol matrix.

Why This Matters (Expanded)

Why This Matters For Eval Research

100% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 100% of papers in this hub.
Charteditbench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Notes (Expanded)

Protocol Takeaways

Most common quality-control signal is inter-annotator agreement reporting (11.1% of papers).
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.
Stratify by benchmark (Charteditbench vs LiveCodeBench) before comparing methods.

Benchmark Interpretation

Charteditbench appears in 5.6% of hub papers (1/18); use this cohort for benchmark-matched comparisons.
LiveCodeBench appears in 5.6% of hub papers (1/18); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 22.2% of hub papers (4/18); compare with a secondary metric before ranking methods.
agreement is reported in 16.7% of hub papers (3/18); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Strong: Papers with explicit human feedback

Coverage is strong (100% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (16.7% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (22.2% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (66.7% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (11.1% vs 35% target).
Strong: Papers with known annotation unit

Coverage is strong (38.9% vs 35% target).

Strengths

Strong human-feedback signal (100% of papers).

Known Gaps

Only 16.7% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (11.1% coverage).

Suggested Next Analyses

Stratify by benchmark (Charteditbench vs LiveCodeBench) before comparing methods.
Track metric sensitivity by reporting both accuracy and agreement.

Recommended Queries (Expanded)

Recommended Queries

Benchmark Slice: Charteditbench Metric Slice: accuracy IAA-Reported Evaluations Recent High-Signal Papers

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

SCOPE: Selective Conformal Optimized Pairwise LLM Judging
Feb 13, 2026 · Citations: 0 · Score: 9.5

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: MT Bench · Metric: Error rate
MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks
Feb 18, 2026 · Citations: 0 · Score: 8.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Memoryarena · Metric: Recall
Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences
Feb 25, 2026 · Citations: 0 · Score: 8.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: LiveCodeBench · Metric: Accuracy
Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language
Feb 21, 2026 · Citations: 0 · Score: 7.5

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Agreement
Same Words, Different Judgments: Modality Effects on Preference Alignment
Feb 26, 2026 · Citations: 0 · Score: 7.5

HF: Pairwise Preference, Rlaif Or Synthetic Feedback · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Agreement
ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models
Feb 17, 2026 · Citations: 0 · Score: 6.5

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Charteditbench · Metric: Not Reported

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
SCOPE: Selective Conformal Optimized Pairwise LLM Judging Feb 13, 2026	Yes Pairwise Preference	Automatic Metrics	MT Bench , LMSYS Chatbot Arena	Error rate	Calibration
MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks Feb 18, 2026	Yes Pairwise Preference	Automatic Metrics	Memoryarena	Recall	Not Reported
Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences Feb 25, 2026	Yes Pairwise Preference	Automatic Metrics	LiveCodeBench , Mathbench	Accuracy	Not Reported
Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language Feb 21, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Agreement	Inter Annotator Agreement Reported , Adjudication
Same Words, Different Judgments: Modality Effects on Preference Alignment Feb 26, 2026	Yes Pairwise Preference , Rlaif Or Synthetic Feedback	Automatic Metrics	Not Reported	Agreement	Inter Annotator Agreement Reported
ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models Feb 17, 2026	Yes Pairwise Preference	Automatic Metrics	Charteditbench	Not Reported	Not Reported
Multi-Objective Alignment of Language Models for Personalized Psychotherapy Feb 17, 2026	Yes Pairwise Preference , Expert Verification	Automatic Metrics	Not Reported	Agreement , Cost	Not Reported
Modeling Distinct Human Interaction in Web Agents Feb 19, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Accuracy	Not Reported
PrivAct: Internalizing Contextual Privacy Preservation via Multi-Agent Preference Training Feb 14, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Helpfulness	Not Reported
CAMEL: Confidence-Gated Reflection for Reward Modeling Feb 24, 2026	Yes Pairwise Preference , Critique Edit	Automatic Metrics	Not Reported	Accuracy , Cost	Not Reported
RLHFless: Serverless Computing for Efficient RLHF Feb 26, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Cost	Not Reported
DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs Feb 25, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Accuracy	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	SCOPE: Selective Conformal Optimized Pairwise LLM J…	MemoryArena: Benchmarking Agent Memory in Interdepe…	Duel-Evolve: Reward-Free Test-Time Scaling via LLM…
Human Feedback	Pairwise Preference	Pairwise Preference	Pairwise Preference
Evaluation Modes	Automatic Metrics	Automatic Metrics	Automatic Metrics
Benchmarks	MT Bench, LMSYS Chatbot Arena	Memoryarena	LiveCodeBench, Mathbench
Metrics	Error rate	Recall	Accuracy
Quality Controls	Calibration	Not reported	Not reported
Rater Population	Unknown	Unknown	Unknown
Annotation Unit	Pairwise	Unknown	Pairwise

Research Utility Snapshot

Human Feedback Mix

Pairwise Preference (18)
Critique Edit (1)
Expert Verification (1)
Rlaif Or Synthetic Feedback (1)

Evaluation Modes

Automatic Metrics (18)

Top Benchmarks

Charteditbench (1)
LiveCodeBench (1)
LMSYS Chatbot Arena (1)
Mathbench (1)

Top Metrics

Accuracy (4)
Agreement (3)
Cost (3)
Error rate (1)

Rater Population Mix

Domain Experts (2)

Quality Controls

Inter Annotator Agreement Reported (2)
Adjudication (1)
Calibration (1)

Coverage diagnostics (sample-based): human-feedback 100.0% · benchmarks 22.2% · metrics 66.7% · quality controls 16.7%.

Top Papers

SCOPE: Selective Conformal Optimized Pairwise LLM Judging
Sher Badshah, Ali Emami, Hassan Sajjad · Feb 13, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation.
MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks
Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen · Feb 18, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Web Browsing

Existing evaluations of agents with memory typically assess memorization and action in isolation.
Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language
Toheeb Aduramomi Jimoh, Tabea De Wille, Nikola S. Nikolov · Feb 21, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

This protocol incorporates context-sensitive interpretation and community-informed guidelines and is accompanied by a comprehensive analysis of inter-annotator agreement to support replication in other African languages.
Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences
Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates.
Same Words, Different Judgments: Modality Effects on Preference Alignment
Aaron Broukhim, Nadir Weibel, Eshin Jolly · Feb 26, 2026 · Citations: 0

Pairwise PreferenceRlaif Or Synthetic Feedback Automatic Metrics

Preference-based reinforcement learning (PbRL) is the dominant framework for aligning AI systems to human preferences, but its application to speech remains underexplored.
Multi-Objective Alignment of Language Models for Personalized Psychotherapy
Mehrab Beikzadeh, Yasaman Asadollah Salmanpour, Ashima Suvarna, Sriram Sankararaman, Matteo Malgaroli · Feb 17, 2026 · Citations: 0

Pairwise PreferenceExpert Verification Automatic Metrics

While AI systems show therapeutic promise, current alignment approaches optimize objectives independently, failing to balance patient preferences with clinical safety.
Modeling Distinct Human Interaction in Web Agents
Faria Huq, Zora Zhiruo Wang, Zhanqiu Guo, Venu Arvind Arangarajan, Tianyue Ou · Feb 19, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Web Browsing

In this work, we introduce the task of modeling human intervention to support collaborative web task execution.
PrivAct: Internalizing Contextual Privacy Preservation via Multi-Agent Preference Training
Yuhan Cheng, Hancheng Ye, Hai Helen Li, Jingwei Sun, Yiran Chen · Feb 14, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Multi Agent

We propose PrivAct, a contextual privacy-aware multi-agent learning framework that internalizes contextual privacy preservation directly into models' generation behavior for privacy-compliant agentic actions.
CAMEL: Confidence-Gated Reflection for Reward Modeling
Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar · Feb 24, 2026 · Citations: 0

Pairwise PreferenceCritique Edit Automatic Metrics

Building on this insight, we propose CAMEL, a confidence-gated reflection framework that performs a lightweight single-token preference decision first and selectively invokes reflection only for low-confidence instances.
ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models
Manav Nitin Kapadnis, Lawanya Baghel, Atharva Naik, Carolyn Rosé · Feb 17, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

In practice, users iteratively refine visualizations through multi-turn interactions that require maintaining common ground, tracking prior edits, and adapting to evolving preferences.
RLHFless: Serverless Computing for Efficient RLHF
Rui Wei, Hanfei Yu, Shubham Jain, Yogarajan Sivakumar, Devesh Tiwari · Feb 26, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Reinforcement Learning from Human Feedback (RLHF) has been widely applied to Large Language Model (LLM) post-training to align model outputs with human preferences.
DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs
Yanbin Wei, Jiangyue Yan, Chun Kang, Yang Chen, Hua Liu · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

This ``one-size-fits-all'' strategy often neglects model-specific and task-specific preferences, resulting in inaccurate or over-lengthy responses to graph-related queries.
Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages
Somnath Banerjee, Rima Hazra, Animesh Mukherjee · Feb 14, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Yet safety pipelines, benchmarks, and alignment still largely target English and a handful of high-resource languages, implicitly assuming safety and factuality ''transfer'' across languages.
CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning
Zhijiang Tang, Linhua Wang, Jiaxin Qi, Weihao Jiang, Peng Hou · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Image captioning remains a fundamental task for vision language understanding, yet ground-truth supervision still relies predominantly on human-annotated references.
Rethinking Metrics for Lexical Semantic Change Detection
Roksana Goworek, Haim Dubossarsky · Feb 17, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Lexical semantic change detection (LSCD) increasingly relies on contextualised language model embeddings, yet most approaches still quantify change using a small set of semantic change metrics, primarily Average Pairwise Distance (APD) and
Investigation for Relative Voice Impression Estimation
Kenichi Fujita, Yusuke Ijima · Feb 15, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

The estimation target is a low-dimensional vector derived from subjective evaluations, quantifying the perceptual shift of the second utterance relative to the first along an antonymic axis (e.g., ``Dark--Bright'').
The ASIR Courage Model: A Phase-Dynamic Framework for Truth Transitions in Human and AI Systems
Hyo Jin Kim · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Although initially formulated for human truth-telling under asymmetric stakes, the same phase-dynamic architecture extends to AI systems operating under policy constraints and alignment filters.
Probing Graph Neural Network Activation Patterns Through Graph Topology
Floriano Tori, Lorenzo Bini, Marco Sorbi, Stéphane Marchand-Maillet, Vincent Ginis · Feb 24, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

However, it remains unclear how the topology of a graph interacts with the learned preferences of GNNs.

Related Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote