HFEPX Hub

Automatic Metrics + General + Pairwise Preference Papers

Updated from current HFEPX corpus (Apr 12, 2026). 59 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 59 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Inter Annotator Agreement Reported. Frequently cited benchmark: MT-Bench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 13, 2026.

Papers: 59 Last published: Feb 13, 2026 Global RSS Tag RSS

Automatic MetricsGeneralPairwise Preference

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Medium .

All Sampled Papers (59) Replication-Ready Only (11)

High-Signal Coverage

100.0%

59 / 59 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

11 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
5 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Start with the top 2 papers in “Start Here”, then validate assumptions in the protocol matrix.

Need evaluators for this research workflow?

Post a Job →

Why This Matters For Eval Research

100% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 100% of papers in this hub.
MT-Bench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

Most common quality-control signal is inter-annotator agreement reporting (6.8% of papers).
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Benchmark Interpretation

MT-Bench appears in 5.1% of hub papers (3/59); use this cohort for benchmark-matched comparisons.
AlpacaEval appears in 3.4% of hub papers (2/59); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 44.1% of hub papers (26/59); compare with a secondary metric before ranking methods.
cost is reported in 13.6% of hub papers (8/59); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Strong: Papers with explicit human feedback

Coverage is strong (100% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (8.5% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (18.6% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (89.8% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (6.8% vs 35% target).
Strong: Papers with known annotation unit

Coverage is strong (47.5% vs 35% target).

Strengths

Strong human-feedback signal (100% of papers).
Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.

Known Gaps

Only 8.5% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (6.8% coverage).
Benchmark coverage is thin (18.6% of papers mention benchmarks/datasets).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (MT-Bench vs AlpacaEval) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.

Recommended Queries (Expanded)

Recommended Queries

Judge vs Human Agreement Benchmark Slice: MT-Bench Metric Slice: accuracy IAA-Reported Evaluations Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

SCOPE: Selective Conformal Optimized Pairwise LLM Judging

Highest protocol score with explicit human/eval signal plus MT-Bench.

Strongest benchmark reference

Personalized RewardBench: Evaluating Reward Models with Human Aligned…

Rewardbench with accuracy gives a fast comparison anchor.

Strongest recent paper

ClimateCheck 2026: Scientific Fact-Checking and Disinformation Narrat…

Useful for current practice scanning; published Mar 27, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

SCOPE: Selective Conformal Optimized Pairwise LLM Judging
Feb 13, 2026 · Citations: 0 · Score: 9.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: MT Bench · Metric: Error rate
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Apr 8, 2026 · Citations: 0 · Score: 8.0

HF: Pairwise Preference, Rubric Rating · Eval: Human Eval, Automatic Metrics · Benchmark: Rewardbench · Metric: Accuracy
ClimateCheck 2026: Scientific Fact-Checking and Disinformation Narrative Classification of Climate-related Claims
Mar 27, 2026 · Citations: 0 · Score: 8.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Codabench · Metric: Recall
DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment
Mar 23, 2026 · Citations: 0 · Score: 8.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: MT Bench · Metric: Accuracy
Elo-Evolve: A Co-evolutionary Framework for Language Model Alignment
Feb 14, 2026 · Citations: 0 · Score: 7.5

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: MT Bench · Metric: Elo
PEARL: Self-Evolving Assistant for Time Management with Reinforcement Learning
Jan 17, 2026 · Citations: 0 · Score: 7.5

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Calconflictbench · Metric: Error rate

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
SCOPE: Selective Conformal Optimized Pairwise LLM Judging Feb 13, 2026	Yes Pairwise Preference	Automatic Metrics	MT Bench , LMSYS Chatbot Arena	Error rate	Calibration
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization Apr 8, 2026	Yes Pairwise Preference , Rubric Rating	Human Eval , Automatic Metrics	Rewardbench	Accuracy , Helpfulness	Not Reported
ClimateCheck 2026: Scientific Fact-Checking and Disinformation Narrative Classification of Climate-related Claims Mar 27, 2026	Yes Pairwise Preference	Automatic Metrics	Codabench	Recall , Recall@k	Not Reported
DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment Mar 23, 2026	Yes Pairwise Preference	Automatic Metrics	MT Bench , AlpacaEval	Accuracy	Not Reported
Elo-Evolve: A Co-evolutionary Framework for Language Model Alignment Feb 14, 2026	Yes Pairwise Preference	Automatic Metrics	MT Bench , AlpacaEval	Elo	Not Reported
PEARL: Self-Evolving Assistant for Time Management with Reinforcement Learning Jan 17, 2026	Yes Pairwise Preference	Automatic Metrics	Calconflictbench	Error rate	Not Reported
MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks Feb 18, 2026	Yes Pairwise Preference	Automatic Metrics	Memoryarena	Recall	Not Reported
From Consensus to Split Decisions: ABC-Stratified Sentiment in Holocaust Oral Histories Mar 30, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Kappa , Agreement	Inter Annotator Agreement Reported
Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation Mar 20, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Kappa , Faithfulness	Inter Annotator Agreement Reported
AILS-NTUA at SemEval-2026 Task 12: Graph-Based Retrieval and Reflective Prompting for Abductive Event Reasoning Mar 4, 2026	Yes Pairwise Preference	Automatic Metrics	Semeval	Accuracy	Not Reported
Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language Feb 21, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Agreement	Inter Annotator Agreement Reported , Adjudication
Same Words, Different Judgments: Modality Effects on Preference Alignment Feb 26, 2026	Yes Pairwise Preference , Rlaif Or Synthetic Feedback	Automatic Metrics	Not Reported	Agreement	Inter Annotator Agreement Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	SCOPE: Selective Conformal Optimized Pairwise LLM J…	Personalized RewardBench: Evaluating Reward Models…	ClimateCheck 2026: Scientific Fact-Checking and Dis…
Human Feedback	Pairwise Preference	Pairwise Preference, Rubric Rating	Pairwise Preference
Evaluation Modes	Automatic Metrics	Human Eval, Automatic Metrics	Automatic Metrics
Benchmarks	MT Bench, LMSYS Chatbot Arena	Rewardbench	Codabench
Metrics	Error rate	Accuracy, Helpfulness	Recall, Recall@k
Quality Controls	Calibration	Not reported	Not reported
Rater Population	Unknown	Unknown	Unknown
Annotation Unit	Pairwise	Pairwise	Unknown

Research Utility Snapshot

Human Feedback Mix

Pairwise Preference (59)
Critique Edit (2)
Rubric Rating (2)
Demonstrations (1)

Evaluation Modes

Automatic Metrics (59)
Human Eval (1)
Llm As Judge (1)

Top Benchmarks

MT Bench (3)
AlpacaEval (2)
Rewardbench (2)
AlpacaEval 2.0 (1)

Top Metrics

Accuracy (26)
Cost (8)
Relevance (6)
Coherence (4)

Rater Population Mix

Domain Experts (4)

Quality Controls

Inter Annotator Agreement Reported (4)
Adjudication (1)
Calibration (1)

Coverage diagnostics (sample-based): human-feedback 100.0% · benchmarks 18.6% · metrics 89.8% · quality controls 8.5%.

Top Papers

SCOPE: Selective Conformal Optimized Pairwise LLM Judging
Sher Badshah, Ali Emami, Hassan Sajjad · Feb 13, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation.
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Qiyao Ma, Dechen Gao, Rui Cai, Boqi Zhao, Hanchu Zhou · Apr 8, 2026 · Citations: 0

Pairwise PreferenceRubric Rating Human EvalAutomatic Metrics

Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values.
Elo-Evolve: A Co-evolutionary Framework for Language Model Alignment
Jing Zhao, Ting Zhen, Junwei Bao, Hongfei Jiang, Yang Song · Feb 14, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Multi Agent

Current alignment methods for Large Language Models (LLMs) rely on compressing vast amounts of human preference data into static, absolute reward functions, leading to data scarcity, noise sensitivity, and training instability.
PEARL: Self-Evolving Assistant for Time Management with Reinforcement Learning
Bingxuan Li, Jeonghwan Kim, Cheng Qian, Xiusi Chen, Eitan Anzenberg · Jan 17, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Long Horizon

To enable a systematic study of this question, we introduce CalConflictBench, a benchmark for long-horizon calendar conflict resolution.
MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks
Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen · Feb 18, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Web Browsing

Existing evaluations of agents with memory typically assess memorization and action in isolation.
From Consensus to Split Decisions: ABC-Stratified Sentiment in Holocaust Oral Histories
Daban Q. Jaff · Mar 30, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

After assembling model outputs, we introduce an agreement-based stability taxonomy (ABC) to stratify inter-model output stability.
Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation
Richard J. Young · Mar 20, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Three classifiers (a regex-only detector, a regex-plus-LLM pipeline, and a Claude Sonnet 4 judge) are applied to 10,276 influenced reasoning traces from 12 open-weight models spanning 9 families and 7B to 1T parameters.
Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language
Toheeb Aduramomi Jimoh, Tabea De Wille, Nikola S. Nikolov · Feb 21, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

This protocol incorporates context-sensitive interpretation and community-informed guidelines and is accompanied by a comprehensive analysis of inter-annotator agreement to support replication in other African languages.
Error Notebook-Guided, Training-Free Part Retrieval in 3D CAD Assemblies via Vision-Language Models
Yunqing Liu, Nan Zhang, Zhiming Tan · Sep 1, 2025 · Citations: 0

Pairwise Preference Automatic Metrics Long Horizon

We additionally contribute a CAD dataset with human preference annotations.
Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition
Zheng Hui, Xiaokai Wei, Yexi Jiang, Kevin Gao, Chen Wang · Apr 26, 2025 · Citations: 0

Pairwise Preference Automatic Metrics Multi Agent

These domains typically involve fixed content and passive consumption, where user preferences can be matched by genre or theme.
HyperMem: Hypergraph Memory for Long-Term Conversations
Juwei Yue, Chuanrui Hu, Jiawei Sheng, Zuyi Zhou, Wenyuan Zhang · Apr 9, 2026 · Citations: 0

Pairwise Preference Llm As JudgeAutomatic Metrics

Long-term memory is essential for conversational agents to maintain coherence, track persistent tasks, and provide personalized interactions across extended dialogues.
Signals: Trajectory Sampling and Triage for Agentic Interactions
Shuguang Chen, Adil Hafeez, Salman Paracha · Apr 1, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Long Horizon

We propose a lightweight, signal-based framework for triaging agentic interaction trajectories.
Learning When to Act: Interval-Aware Reinforcement Learning with Predictive Temporal Structure
Davide Di Gioia · Mar 23, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Long Horizon

Autonomous agents operating in continuous environments must decide not only what to do, but when to act.
The Geometry of Dialogue: Graphing Language Models to Reveal Synergistic Teams for Multi-Agent Collaboration
Kotaro Furuya, Yuichi Kitagawa · Oct 30, 2025 · Citations: 0

Pairwise Preference Automatic Metrics Multi Agent

While a multi-agent approach based on large language models (LLMs) represents a promising strategy to surpass the capabilities of single models, its success is critically dependent on synergistic team composition.
How Reliable is Language Model Micro-Benchmarking?
Gregory Yauney, Shahzaib Saqib Warraich, Swabha Swayamdipta · Oct 9, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

We introduce a meta-evaluation measure for micro-benchmarking which investigates how well a micro-benchmark can rank two models as a function of their performance difference on the full benchmark.
Same Words, Different Judgments: Modality Effects on Preference Alignment
Aaron Broukhim, Nadir Weibel, Eshin Jolly · Feb 26, 2026 · Citations: 0

Pairwise PreferenceRlaif Or Synthetic Feedback Automatic Metrics

Preference-based reinforcement learning (PbRL) is the dominant framework for aligning AI systems to human preferences, but its application to speech remains underexplored.
Aligning Multimodal Sequential Recommendations via Robust Direct Preference Optimization with Sparse MoE
Hejin Huang, Jusheng Zhang, Kaitong Cai, Jian Wang, Rong Pan · Mar 31, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Preference-based alignment objectives have been widely adopted, from RLHF-style pairwise learning in large language models to emerging applications in recommender systems.
PrefDisco: Benchmarking Proactive Personalized Reasoning
Shuyue Stella Li, Avinandan Bose, Faeze Brahman, Simon Shaolei Du, Pang Wei Koh · Sep 30, 2025 · Citations: 0

Pairwise PreferenceRubric Rating Automatic Metrics

We introduce PrefDisco, an evaluation methodology that transforms static benchmarks into interactive personalization tasks using psychologically-grounded personas with sparse, context-dependent preferences, and define PrefAlign as a…
ClimateCheck 2026: Scientific Fact-Checking and Disinformation Narrative Classification of Climate-related Claims
Raia Abu Ahmad, Max Upravitelev, Aida Usmanova, Veronika Solopova, Georg Rehm · Mar 27, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

In addition to standard evaluation metrics (Recall@K and Binary Preference), we adapt an automated framework to assess retrieval quality under incomplete annotations, exposing systematic biases in how conventional metrics rank systems.
DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment
James Wedgwood, Aashiq Muhamed, Mona T. Diab, Virginia Smith · Mar 23, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Preference alignment is usually achieved by weight-updating training on preference data, which adds substantial alignment-stage compute and provides limited mechanistic visibility.
AILS-NTUA at SemEval-2026 Task 12: Graph-Based Retrieval and Reflective Prompting for Abductive Event Reasoning
Nikolas Karafyllis, Maria Lymperaiou, Giorgos Filandrianos, Athanasios Voulodimos, Giorgos Stamou · Mar 4, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

We present a winning three-stage system for SemEval 2026 Task~12: Abductive Event Reasoning that combines graph-based retrieval, LLM-driven abductive reasoning with prompt design optimized through reflective prompt evolution, and post-hoc…
Modeling Distinct Human Interaction in Web Agents
Faria Huq, Zora Zhiruo Wang, Zhanqiu Guo, Venu Arvind Arangarajan, Tianyue Ou · Feb 19, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Web Browsing

In this work, we introduce the task of modeling human intervention to support collaborative web task execution.
HUMORCHAIN: Theory-Guided Multi-Stage Reasoning for Interpretable Multimodal Humor Generation
Jiajun Zhang, Shijia Luo, Ruikang Zhang, Qi Su · Nov 21, 2025 · Citations: 0

Pairwise Preference Automatic Metrics Long Horizon

Humor, as both a creative human activity and a social binding mechanism, has long posed a major challenge for AI generation.
BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning
Qiusi Zhan, Hyeonjeong Ha, Rui Yang, Sirui Xu, Hanyang Chen · Oct 31, 2025 · Citations: 0

Pairwise Preference Automatic Metrics Long Horizon

We introduce BEAT, the first framework to inject such visual backdoors into VLM-based embodied agents using objects in the environments as triggers.
Just as Humans Need Vaccines, So Do Models: Model Immunization to Combat Falsehoods
Shaina Raza, Rizwan Qureshi, Azib Farooq, Marcelo Lotif, Aman Chadha · May 23, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

Unlike post-hoc filtering or preference-based alignment, immunization introduces direct negative supervision on labeled falsehoods.
VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models
Yuchen Yan, Jin Jiang, Zhenbang Ren, Yijun Li, Xudong Cai · May 21, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

In this paper, we introduce VerifyBench and its challenging variant VerifyBench-Hard, two benchmarks specifically designed to assess reference-based reward systems.
CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation
Faria Huq, Zora Zhiruo Wang, Frank F. Xu, Tianyue Ou, Shuyan Zhou · Jan 28, 2025 · Citations: 0

Pairwise PreferenceDemonstrations Automatic Metrics Web Browsing

We propose CowPilot, a framework supporting autonomous as well as human-agent collaborative web navigation, and evaluation across task success and task efficiency.
MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control
Yuchi Wang, Haiyang Yu, Weikang Bian, Jiefeng Long, Xiao Liang · Apr 7, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Experiments on the MMEB-V2 benchmark demonstrate that our model achieves a score of 71.2 with only 4B parameters, establishing a new state-of-the-art while significantly reducing reasoning overhead and inference latency.
Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning
Yuhang Wu, Xiangqing Shen, Fanfan Wang, Cangqi Zhou, Zhen Wu · Apr 2, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

However, current reranking models are typically optimized on static human annotated relevance labels in isolation, decoupled from the downstream generation process.
Preference learning in shades of gray: Interpretable and bias-aware reward modeling for human preferences
Simona-Vasilica Oprea, Adela Bâra · Apr 1, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Using the Anthropic HHRLHF dataset, we evaluate ten diverse large language models LLMs under a standard pairwise preference setting, where baseline performance remains below 0.74 ROC AUC, highlighting the difficulty of the task.
MemRerank: Preference Memory for Personalized Product Reranking
Zhiyuan Peng, Xuyang Wu, Huaixiao Tou, Yi Fang, Yu Gong · Mar 31, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

LLM-based shopping agents increasingly rely on long purchase histories and multi-turn interactions for personalization, yet naively appending raw history to prompts is often ineffective due to noise, length, and relevance mismatch.
Routing Sensitivity Without Controllability: A Diagnostic Study of Fairness in MoE Language Models
Junhyeok Lee, Kyu Sung Choi · Mar 28, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

FARE reveals that routing-level preference shifts are either unachievable (Mixtral, Qwen1.5, Qwen3), statistically non-robust (DeepSeekMoE), or accompanied by substantial utility cost (OLMoE, -4.4%p CrowS-Pairs at -6.3%p TQA).
A Comparative Empirical Study of Catastrophic Forgetting Mitigation in Sequential Task Adaptation for Continual Natural Language Processing Systems
Aram Abrahamyan, Sachin Kumar · Mar 19, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Performance is assessed with average accuracy, macro F1, and backward transfer, capturing the stability-plasticity trade-off across the task sequence.
CAMEL: Confidence-Gated Reflection for Reward Modeling
Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar · Feb 24, 2026 · Citations: 0

Pairwise PreferenceCritique Edit Automatic Metrics

Building on this insight, we propose CAMEL, a confidence-gated reflection framework that performs a lightweight single-token preference decision first and selectively invokes reflection only for low-confidence instances.
Learning Ordinal Probabilistic Reward from Preferences
Longze Chen, Lu Wang, Renke Shan, Ze Gong, Run Luo · Feb 13, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Reward models are crucial for aligning large language models (LLMs) with human values and intentions.
Should LLMs, like, Generate How Users Talk? Building Dialect-Accurate Dialog[ue]s Beyond the American Default with MDial
Jio Oh, Paul Vicinanza, Thomas Butler, Steven Euijong Whang, Dezhi Hong · Jan 30, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Independent evaluations confirm data quality, with annotators preferring MDial outputs over prior methods in 98% of pairwise comparisons for dialect naturalness.
Does Tone Change the Answer? Evaluating Prompt Politeness Effects on Modern LLMs: GPT, Gemini, and LLaMA
Hanyu Cai, Binqi Shen, Lier Jin, Lan Hu, Xiaojing Fan · Dec 14, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

In this work, we propose a systematic evaluation framework to examine how interaction tone affects model accuracy and apply it to three recently released and widely available LLMs: GPT-4o mini (OpenAI), Gemini 2.0 Flash (Google DeepMind),…
PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity
Zixin Song, Bowen Zhang, Qian-Wen Zhang, Di Yin, Xing Sun · Oct 5, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

On the official C-STS benchmark, PoLi-RL achieves a Spearman correlation coefficient of 48.18, establishing a new SOTA for the cross-encoder architecture.
PLOT: Enhancing Preference Learning via Optimal Transport
Liang Zhu, Yuelin Bai, Xiankun Ren, Jiaxi Yang, Lei Zhang · Apr 2, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Preference learning in Large Language Models (LLMs) has advanced significantly, yet existing methods remain limited by modest performance gains, high computational costs, hyperparameter sensitivity, and insufficient modeling of global…
ThinknCheck: Grounded Claim Verification with Compact, Reasoning-Driven, and Interpretable Models
Delip Rao, Feijiang Han, Chris Callison-Burch · Apr 2, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

By contrast, zero-shot chain-of-thought on the base Gemma3-1B harms accuracy relative to direct answers, and preference optimization with a simple format+accuracy reward underperforms supervised reasoning.
OneSearch-V2: The Latent Reasoning Enhanced Self-distillation Generative Search Framework
Ben Chen, Siyuan Wang, Yufei Ma, Zihan Liang, Xuxin Zhang · Mar 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

However, its inadequate understanding of complex queries, inefficient exploitation of latent user intents, and overfitting to narrow historical preferences have limited its further performance improvement.
BeliefShift: Benchmarking Temporal Belief Consistency and Opinion Drift in LLM Agents
Praveen Kumar Myakala, Manan Agrawal, Rahul Manche · Mar 25, 2026 · Citations: 0

Pairwise PreferenceCritique Edit Automatic Metrics

LLMs are increasingly used as long-running conversational agents, yet every major benchmark evaluating their memory treats user information as static facts to be stored and retrieved.
LLMs Can Infer Political Alignment from Online Conversations
Byunghwee Lee, Sangyeon Kim, Filippo Menczer, Yong-Yeol Ahn, Haewoon Kwak · Mar 11, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Due to the correlational structure in our traits such as identities, cultures, and political attitudes, seemingly innocuous preferences such as following a band or using a specific slang, can reveal private traits.
Can Safety Emerge from Weak Supervision? A Systematic Analysis of Small Language Models
Punyajoy Saha, Sudipta Halder, Debjyoti Mondal, Subhadarshi Panda · Mar 7, 2026 · Citations: 0

Pairwise PreferenceRed Team Automatic Metrics

Safety alignment is critical for deploying large language models (LLMs) in real-world applications, yet most existing approaches rely on large human-annotated datasets and static red-teaming benchmarks that are costly, difficult to scale,…
When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger
Amirabbas Afzali, Myeongho Jeon, Maria Brbic · Mar 5, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Building on this insight, we propose Confidence-Weighted Preference Optimization (CW-PO), a general framework that re-weights training samples by a weak LLM's confidence and can be applied across different preference optimization…
LocalSUG: Geography-Aware LLM for Query Suggestion in Local-Life Services
Jinwen Chen, Shuai Gong, Shiwen Zhang, Zheng Zhang, Yachao Zhao · Mar 5, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

While LLMs offer strong semantic generalization, deploying them in local-life services introduces three key challenges: lack of geographic grounding, exposure bias in preference optimization, and online inference latency.
MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization
Ashutosh Chaubey, Jiacheng Pang, Mohammad Soleymani · Mar 3, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

In this work, we propose Modality-Decoupled Direct Preference Optimization (MoD-DPO), a simple and effective framework for improving modality grounding in omni LLMs.
RLHFless: Serverless Computing for Efficient RLHF
Rui Wei, Hanfei Yu, Shubham Jain, Yogarajan Sivakumar, Devesh Tiwari · Feb 26, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Reinforcement Learning from Human Feedback (RLHF) has been widely applied to Large Language Model (LLM) post-training to align model outputs with human preferences.
DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs
Yanbin Wei, Jiangyue Yan, Chun Kang, Yang Chen, Hua Liu · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

This ``one-size-fits-all'' strategy often neglects model-specific and task-specific preferences, resulting in inaccurate or over-lengthy responses to graph-related queries.
PACIFIC: Can LLMs Discern the Traits Influencing Your Preferences? Evaluating Personality-Driven Preference Alignment in LLMs
Tianyu Zhao, Siqi Li, Yasser Shoukry, Salma Elmalaki · Feb 6, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Based on these findings, we introduce PACIFIC (Preference Alignment Choices Inference for Five-factor Identity Characterization), a personality-labeled preference dataset containing 1200 preference statements spanning diverse domains (e.g.,…
Robust Preference Alignment via Directional Neighborhood Consensus
Ruochen Mao, Yuling Shi, Xiaodong Gu, Jiaheng Wei · Oct 23, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

To address this challenge, we introduce Robust Preference Selection (RPS), a post-hoc, training-free method by leveraging directional neighborhood consensus.
Difficulty-Controllable Multiple-Choice Question Generation Using Large Language Models and Direct Preference Optimization
Yuto Tomikawa, Masaki Uto · Oct 22, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

To address these limitations, this study proposes a novel difficulty-controllable multiple-choice question generation method for reading comprehension which leverages a large language model trained using a direct preference optimization…
Spectrum Tuning: Post-Training for Distributional Coverage and In-Context Steerability
Taylor Sorensen, Benjamin Newman, Jared Moore, Chan Park, Jillian Fisher · Oct 7, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

On many tasks such as creative writing, synthetic data generation, or steering to diverse preferences, models must cover an entire distribution of outputs, rather than a single correct answer.
Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence
Wenzhe Yin, Zehao Xiao, Pan Zhou, Shujian Yu, Jiayi Shen · Feb 24, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

Vision-language alignment is crucial for various downstream tasks such as cross-modal generation and retrieval.
CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning
Zhijiang Tang, Linhua Wang, Jiaxin Qi, Weihao Jiang, Peng Hou · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Image captioning remains a fundamental task for vision language understanding, yet ground-truth supervision still relies predominantly on human-annotated references.
Investigation for Relative Voice Impression Estimation
Kenichi Fujita, Yusuke Ijima · Feb 15, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

The estimation target is a low-dimensional vector derived from subjective evaluations, quantifying the perceptual shift of the second utterance relative to the first along an antonymic axis (e.g., ``Dark--Bright'').
The ASIR Courage Model: A Phase-Dynamic Framework for Truth Transitions in Human and AI Systems
Hyo Jin Kim · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Although initially formulated for human truth-telling under asymmetric stakes, the same phase-dynamic architecture extends to AI systems operating under policy constraints and alignment filters.
Probing Graph Neural Network Activation Patterns Through Graph Topology
Floriano Tori, Lorenzo Bini, Marco Sorbi, Stéphane Marchand-Maillet, Vincent Ginis · Feb 24, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

However, it remains unclear how the topology of a graph interacts with the learned preferences of GNNs.
Dual-IPO: Dual-Iterative Preference Optimization for Text-to-Video Generation
Xiaomeng Yang, Mengping Yang, Jia Gong, Luozheng Qin, Zhiyu Tan · Feb 4, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

However, they usually fail to produce satisfactory outputs that are aligned to users' authentic demands and preferences.

Related Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now