HFEPX Hub

CS.LG + Pairwise Preference Papers

Updated from current HFEPX corpus (Feb 27, 2026). 21 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Calibration. Frequently cited benchmark: LiveCodeBench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 21 Last published: Feb 26, 2026 Global RSS Tag RSS

Cs.LGPairwise Preference

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 21 papers for CS.LG + Pairwise Preference Papers. Dominant protocol signals include automatic metrics, human evaluation, simulation environments, with frequent benchmark focus on LiveCodeBench, Mathbench and metric focus on accuracy, agreement. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

100% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: Moral Preferences of LLMs Under Directed Contextual Influence , Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences , Probing Graph Neural Network Activation Patterns Through Graph Topology , Hierarchical Reward Design from Language: Enhancing Alignment of Agent Behavior with Human Specifications
automatic metrics appears in 90.5% of papers in this hub.

Evidence: Moral Preferences of LLMs Under Directed Contextual Influence , Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences , Probing Graph Neural Network Activation Patterns Through Graph Topology , Hierarchical Reward Design from Language: Enhancing Alignment of Agent Behavior with Human Specifications
LiveCodeBench is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences , Moral Preferences of LLMs Under Directed Contextual Influence , Probing Graph Neural Network Activation Patterns Through Graph Topology , Hierarchical Reward Design from Language: Enhancing Alignment of Agent Behavior with Human Specifications

Protocol Takeaways

Most common quality-control signal is rater calibration (4.8% of papers).

Evidence: Who can we trust? LLM-as-a-jury for Comparative Assessment , Moral Preferences of LLMs Under Directed Contextual Influence , Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences , Probing Graph Neural Network Activation Patterns Through Graph Topology
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.

Evidence: Simplifying Outcomes of Language Model Component Analyses with ELIA , Multi-Objective Alignment of Language Models for Personalized Psychotherapy , Multimodal Integrated Knowledge Transfer to Large Language Models through Preference Optimization with Biomedical Applications , Moral Preferences of LLMs Under Directed Contextual Influence
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Evidence: Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning , Moral Preferences of LLMs Under Directed Contextual Influence , Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences , Probing Graph Neural Network Activation Patterns Through Graph Topology

Benchmark Interpretation

LiveCodeBench appears in 4.8% of hub papers (1/21); use this cohort for benchmark-matched comparisons.
Mathbench appears in 4.8% of hub papers (1/21); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 9.5% of hub papers (2/21); compare with a secondary metric before ranking methods.
agreement is reported in 4.8% of hub papers (1/21); compare with a secondary metric before ranking methods.

Researcher Checklist

Maintain strength on Papers with explicit human feedback. Coverage is strong (100% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (4.8% vs 30% target).
Close gap on Papers naming benchmarks/datasets. Coverage is a replication risk (9.5% vs 35% target).
Tighten coverage on Papers naming evaluation metrics. Coverage is usable but incomplete (23.8% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (14.3% vs 35% target).
Maintain strength on Papers with known annotation unit. Coverage is strong (52.4% vs 35% target).

Papers with explicit human feedback

Coverage is strong (100% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (4.8% vs 30% target).

Papers naming benchmarks/datasets

Coverage is a replication risk (9.5% vs 35% target).

Papers naming evaluation metrics

Coverage is usable but incomplete (23.8% vs 35% target).

Papers with known rater population

Coverage is a replication risk (14.3% vs 35% target).

Papers with known annotation unit

Coverage is strong (52.4% vs 35% target).

Known Limitations

Only 4.8% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (14.3% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Human Eval Protocols - Surfaces human-rating workflows for rubric and annotator quality analysis.
Benchmark Slice: LiveCodeBench - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs automatic_metrics

both=0, left_only=1, right_only=19

0 papers use both Human Eval and Automatic Metrics.

automatic_metrics vs simulation_env

both=0, left_only=19, right_only=1

0 papers use both Automatic Metrics and Simulation Env.

human_eval vs simulation_env

both=0, left_only=1, right_only=1

0 papers use both Human Eval and Simulation Env.

Benchmark Brief

LiveCodeBench

Coverage: 1 papers (4.8%)

1 papers (4.8%) mention LiveCodeBench.

Examples: Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences

Benchmark Brief

Mathbench

Coverage: 1 papers (4.8%)

1 papers (4.8%) mention Mathbench.

Examples: Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences

Benchmark Brief

Retrieval

Coverage: 1 papers (4.8%)

1 papers (4.8%) mention Retrieval.

Examples: Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence

Metric Brief

accuracy

Coverage: 2 papers (9.5%)

2 papers (9.5%) mention accuracy.

Examples: Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences , Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing

Metric Brief

agreement

Coverage: 1 papers (4.8%)

1 papers (4.8%) mention agreement.

Examples: Multi-Objective Alignment of Language Models for Personalized Psychotherapy

Metric Brief

calibration

Coverage: 1 papers (4.8%)

1 papers (4.8%) mention calibration.

Examples: Who can we trust? LLM-as-a-jury for Comparative Assessment

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Moral Preferences of LLMs Under Directed Contextual Influence , Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences , Probing Graph Neural Network Activation Patterns Through Graph Topology

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers

Moral Preferences of LLMs Under Directed Contextual Influence
Phil Blandfort, Tushar Karayil, Urja Pawar, Robert Graham, Alex McKenzie · Feb 26, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Moral benchmarks for LLMs typically use context-free prompts, implicitly assuming stable preferences.
Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences
Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates.
Probing Graph Neural Network Activation Patterns Through Graph Topology
Floriano Tori, Lorenzo Bini, Marco Sorbi, Stéphane Marchand-Maillet, Vincent Ginis · Feb 24, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

However, it remains unclear how the topology of a graph interacts with the learned preferences of GNNs.
Hierarchical Reward Design from Language: Enhancing Alignment of Agent Behavior with Human Specifications
Zhiqin Qian, Ryan Diaz, Sangwon Seo, Vaibhav Unhelkar · Feb 20, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Long Horizon

When training artificial intelligence (AI) to perform tasks, humans often care not only about whether a task is completed but also how it is performed.
Simplifying Outcomes of Language Model Component Analyses with ELIA
Aaron Louis Eidt, Nils Feldhus · Feb 20, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

The effectiveness of this approach was empirically validated through a mixed-methods user study, which revealed a clear preference for interactive, explorable interfaces over simpler, static visualizations.
Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment
Yuyan Bu, Xiaohao Liu, ZhaoXing Ren, Yaodong Yang, Juntao Dai · Feb 18, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

The widespread deployment of large language models (LLMs) across linguistic communities necessitates reliable multilingual safety alignment.
Who can we trust? LLM-as-a-jury for Comparative Assessment
Mengjie Qian, Guangzhi Sun, Mark J. F. Gales, Kate M. Knill · Feb 18, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwise comparative judgements.
Learning Personalized Agents from Human Feedback
Kaiqu Liang, Julia Kruk, Shengyi Qian, Xianjun Yang, Shengjie Bi · Feb 18, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Modern AI agents are powerful but often fail to align with the idiosyncratic, evolving preferences of individual users.
Multi-Objective Alignment of Language Models for Personalized Psychotherapy
Mehrab Beikzadeh, Yasaman Asadollah Salmanpour, Ashima Suvarna, Sriram Sankararaman, Matteo Malgaroli · Feb 17, 2026 · Citations: 0

Pairwise PreferenceExpert Verification Automatic Metrics

While AI systems show therapeutic promise, current alignment approaches optimize objectives independently, failing to balance patient preferences with clinical safety.
The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems
Xiaoze Liu, Ruowang Zhang, Weichen Yu, Siheng Xiong, Liu He · Feb 17, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Multi Agent

Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain shackled by the inefficiency of discrete text communication, which imposes significant runtime overhead and informati
Cold-Start Personalization via Training-Free Priors from Structured World Models
Avinandan Bose, Shuyue Stella Li, Faeze Brahman, Pang Wei Koh, Simon Shaolei Du · Feb 16, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Cold-start personalization requires inferring user preferences through interaction when no user-specific historical data is available.
Investigation for Relative Voice Impression Estimation
Kenichi Fujita, Yusuke Ijima · Feb 15, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

The estimation target is a low-dimensional vector derived from subjective evaluations, quantifying the perceptual shift of the second utterance relative to the first along an antonymic axis (e.g., ``Dark--Bright'').
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz · Jan 14, 2026 · Citations: 0

Pairwise Preference Simulation Env Long Horizon

Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodie
ARGUS: Adaptive Rotation-Invariant Geometric Unsupervised System
Anantha Sharma · Jan 3, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Detecting distributional drift in high-dimensional data streams presents fundamental challenges: global comparison methods scale poorly, projection-based approaches lose geometric structure, and re-clustering methods suffer from identity in
Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning
Ran Xu, Jingjing Chen, Jiayu Ye, Yu Wu, Jun Yan · Oct 27, 2025 · Citations: 0

Pairwise Preference Human Eval

Large Language Models (LLMs) are widely used as judges to evaluate response quality, providing a scalable alternative to human evaluation.
Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing
Rongzhi Zhang, Liqin Ye, Yuzhao Heng, Xiang Chen, Tong Yu · Oct 14, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

Finally, we demonstrate efficiency enhancements across three downstream tasks: preference data synthesis, Pareto frontier approximation and optimization, and distillation of aligned behaviors for intervention-free inference.
CORE: Measuring Multi-Agent LLM Interaction Quality under Game-Theoretic Pressures
Punya Syon Pandey, Yongjin Yang, Jiarui Liu, Zhijing Jin · Aug 16, 2025 · Citations: 0

Pairwise Preference Automatic Metrics Multi Agent

Game-theoretic interactions between agents with Large Language Models (LLMs) have revealed many emergent capabilities, yet the linguistic diversity of these interactions has not been sufficiently quantified.
Multimodal Integrated Knowledge Transfer to Large Language Models through Preference Optimization with Biomedical Applications
Zhanliang Wang, Da Wu, Quan Nguyen, Zhuoran Xu, Kai Wang · May 9, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

To address this challenge, we introduce MINT (Multimodal Integrated kNowledge Transfer), a framework that aligns unimodal large decoder models with domain-specific decision patterns from multimodal biomedical data through preference optimiz
Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning
Julian Minder, Clément Dumas, Caden Juang, Bilal Chugtai, Neel Nanda · Apr 3, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

Using the BatchTopK crosscoder, we successfully identify a set of chat-specific latents that are both interpretable and causally effective, representing concepts such as $\textit{false information}$ and $\textit{personal question}$, along w
Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence
Wenzhe Yin, Zehao Xiao, Pan Zhou, Shujian Yu, Jiayi Shen · Feb 24, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

Vision-language alignment is crucial for various downstream tasks such as cross-modal generation and retrieval.
Efficient Context Propagating Perceiver Architectures for Auto-Regressive Language Modeling
Kaleel Mahmood, Shaoyi Huang · Dec 8, 2024 · Citations: 0

Pairwise Preference Automatic Metrics

One of the key challenges in Transformer architectures is the quadratic complexity of the attention mechanism, which limits the efficient processing of long sequences.

CS.LG + Pairwise Preference Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers

Related Hubs