HFEPX Hub

Multi Agent + Pairwise Preference Papers

Updated from current HFEPX corpus (Mar 10, 2026). 10 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Mar 10, 2026). 10 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequently cited benchmark: AlpacaEval. Common metric signal: elo. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 14, 2026.

Papers: 10 Last published: Feb 14, 2026 Global RSS Tag RSS

Multi AgentPairwise Preference

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (10) Replication-Ready Only (1)

High-Signal Coverage

100.0%

10 / 10 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

1 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
0 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Use this page for scouting only; collect additional papers before attempting replication-critical comparisons.

Need evaluators for this research workflow?

Post a Job →

Why This Matters For Eval Research

100% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 30% of papers in this hub.
AlpacaEval is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Benchmark Interpretation

AlpacaEval appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.
AlpacaEval 2.0 appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.

Metric Interpretation

elo is reported in 10% of hub papers (1/10); compare with a secondary metric before ranking methods.
helpfulness is reported in 10% of hub papers (1/10); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Strong: Papers with explicit human feedback

Coverage is strong (100% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (10% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (30% vs 35% target).
Moderate: Papers with known rater population

Coverage is usable but incomplete (30% vs 35% target).
Strong: Papers with known annotation unit

Coverage is strong (70% vs 35% target).

Strengths

Strong human-feedback signal (100% of papers).
Agentic evaluation appears in 100% of papers.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Benchmark coverage is thin (10% of papers mention benchmarks/datasets).
LLM-as-judge appears without enough inter-annotator agreement reporting.

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (AlpacaEval vs AlpacaEval 2.0) before comparing methods.
Track metric sensitivity by reporting both elo and helpfulness.

Recommended Queries (Expanded)

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: AlpacaEval Metric Slice: elo Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

Elo-Evolve: A Co-evolutionary Framework for Language Model Alignment

Highest protocol score with explicit human/eval signal plus MT-Bench.

Strongest benchmark reference

PrivAct: Internalizing Contextual Privacy Preservation via Multi-Agen…

Reported benchmark with helpfulness gives a fast comparison anchor.

Strongest recent paper

Build, Judge, Optimize: A Blueprint for Continuous Improvement of Mul…

Useful for current practice scanning; published Mar 3, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

Elo-Evolve: A Co-evolutionary Framework for Language Model Alignment
Feb 14, 2026 · Citations: 0 · Score: 8.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: MT Bench · Metric: Elo
PrivAct: Internalizing Contextual Privacy Preservation via Multi-Agent Preference Training
Feb 14, 2026 · Citations: 0 · Score: 6.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Helpfulness
Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants
Mar 3, 2026 · Citations: 0 · Score: 4.5

HF: Pairwise Preference, Rubric Rating · Eval: Llm As Judge, Simulation Env · Benchmark: Not Reported · Metric: Not Reported
Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition
Apr 26, 2025 · Citations: 0 · Score: 4.5

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Hit@5
Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation
Feb 16, 2026 · Citations: 0 · Score: 4.5

HF: Pairwise Preference, Rubric Rating · Eval: Not reported · Benchmark: Not Reported · Metric: Not Reported
Exploring Plan Space through Conversation: An Agentic Framework for LLM-Mediated Explanations in Planning
Mar 2, 2026 · Citations: 0 · Score: 4.5

HF: Pairwise Preference · Eval: Not reported · Benchmark: Not Reported · Metric: Not Reported

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
Elo-Evolve: A Co-evolutionary Framework for Language Model Alignment Feb 14, 2026	Yes Pairwise Preference	Automatic Metrics	MT Bench , AlpacaEval	Elo	Not Reported
PrivAct: Internalizing Contextual Privacy Preservation via Multi-Agent Preference Training Feb 14, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Helpfulness	Not Reported
Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants Mar 3, 2026	Yes Pairwise Preference , Rubric Rating	Llm As Judge , Simulation Env	Not Reported	Not Reported	Not Reported
Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition Apr 26, 2025	Yes Pairwise Preference	Automatic Metrics	Not Reported	Hit@5	Not Reported
Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation Feb 16, 2026	Yes Pairwise Preference , Rubric Rating	Not Reported	Not Reported	Not Reported	Not Reported
Exploring Plan Space through Conversation: An Agentic Framework for LLM-Mediated Explanations in Planning Mar 2, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported
Toward Expert Investment Teams:A Multi-Agent LLM System with Fine-Grained Trading Tasks Feb 26, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported
Decentralized Ranking Aggregation: Gossip Algorithms for Borda and Copeland Consensus Feb 26, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported
The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems Feb 17, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported
CORE: Measuring Multi-Agent LLM Interaction Quality under Game-Theoretic Pressures Aug 16, 2025	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	Elo-Evolve: A Co-evolutionary Framework for Languag…	PrivAct: Internalizing Contextual Privacy Preservat…	Build, Judge, Optimize: A Blueprint for Continuous…
Human Feedback	Pairwise Preference	Pairwise Preference	Pairwise Preference, Rubric Rating
Evaluation Modes	Automatic Metrics	Automatic Metrics	Llm As Judge, Simulation Env
Benchmarks	MT Bench, AlpacaEval	Not reported	Not reported
Metrics	Elo	Helpfulness	Not reported
Quality Controls	Not reported	Not reported	Not reported
Rater Population	Unknown	Unknown	Unknown
Annotation Unit	Pairwise	Unknown	Multi Dim Rubric

Research Utility Snapshot

Human Feedback Mix

Pairwise Preference (10)
Rubric Rating (2)
Expert Verification (1)

Evaluation Modes

Automatic Metrics (3)
Llm As Judge (1)
Simulation Env (1)

Top Benchmarks

AlpacaEval (1)
AlpacaEval 2.0 (1)
MT Bench (1)

Top Metrics

Elo (1)
Helpfulness (1)
Hit@5 (1)

Rater Population Mix

Domain Experts (3)

Quality Controls

Coverage diagnostics (sample-based): human-feedback 100.0% · benchmarks 10.0% · metrics 30.0% · quality controls 0.0%.

Top Papers

Elo-Evolve: A Co-evolutionary Framework for Language Model Alignment
Jing Zhao, Ting Zhen, Junwei Bao, Hongfei Jiang, Yang Song · Feb 14, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Multi Agent

Current alignment methods for Large Language Models (LLMs) rely on compressing vast amounts of human preference data into static, absolute reward functions, leading to data scarcity, noise sensitivity, and training instability.
Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants
Alejandro Breen Herrera, Aayush Sheth, Steven G. Xu, Zhucheng Zhan, Charles Wright · Mar 3, 2026 · Citations: 0

Pairwise PreferenceRubric Rating Llm As JudgeSimulation Env Long Horizon

Conversational shopping assistants (CSAs) represent a compelling application of agentic AI, but moving from prototype to production reveals two underexplored challenges: how to evaluate multi-turn interactions and how to optimize tightly…
Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition
Zheng Hui, Xiaokai Wei, Yexi Jiang, Kevin Gao, Chen Wang · Apr 26, 2025 · Citations: 0

Pairwise Preference Automatic Metrics Multi Agent

These domains typically involve fixed content and passive consumption, where user preferences can be matched by genre or theme.
PrivAct: Internalizing Contextual Privacy Preservation via Multi-Agent Preference Training
Yuhan Cheng, Hancheng Ye, Hai Helen Li, Jingwei Sun, Yiran Chen · Feb 14, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Multi Agent

We propose PrivAct, a contextual privacy-aware multi-agent learning framework that internalizes contextual privacy preservation directly into models' generation behavior for privacy-compliant agentic actions.
Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation
Shiwei Hong, Lingyao Li, Ethan Z. Rong, Chenxinran Shen, Zhicong Lu · Feb 16, 2026 · Citations: 0

Pairwise PreferenceRubric Rating Multi Agent

Prior work has explored multi-turn interaction and feedback for LLM writing, but evaluations still largely center on prompts and localized feedback, leaving persistent public reception in online communities underexamined.
Exploring Plan Space through Conversation: An Agentic Framework for LLM-Mediated Explanations in Planning
Guilhem Fouilhé, Rebecca Eifler, Antonin Poché, Sylvie Thiébaux, Nicholas Asher · Mar 2, 2026 · Citations: 0

Pairwise Preference Multi Agent

When automating plan generation for a real-world sequential decision problem, the goal is often not to replace the human planner, but to facilitate an iterative reasoning and elicitation process, where the human's role is to guide the AI…
Toward Expert Investment Teams:A Multi-Agent LLM System with Fine-Grained Trading Tasks
Kunihiro Miyazaki, Takanobu Kawahara, Stephen Roberts, Stefan Zohren · Feb 26, 2026 · Citations: 0

Pairwise Preference Multi Agent

While mainstream approaches deploy multi-agent systems mimicking analyst and manager roles, they often rely on abstract instructions that overlook the intricacies of real-world workflows, which can lead to degraded inference performance and…
Decentralized Ranking Aggregation: Gossip Algorithms for Borda and Copeland Consensus
Anna Van Elst, Kerrian Le Caillec, Igor Colin, Stephan Clémençon · Feb 26, 2026 · Citations: 0

Pairwise Preference Multi Agent

The concept of ranking aggregation plays a central role in preference analysis, and numerous algorithms for calculating median rankings, often originating in social choice theory, have been documented in the literature, offering theoretical…
The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems
Xiaoze Liu, Ruowang Zhang, Weichen Yu, Siheng Xiong, Liu He · Feb 17, 2026 · Citations: 0

Pairwise Preference Multi Agent

Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain shackled by the inefficiency of discrete text communication, which imposes significant runtime overhead and…
CORE: Measuring Multi-Agent LLM Interaction Quality under Game-Theoretic Pressures
Punya Syon Pandey, Yongjin Yang, Jiarui Liu, Zhijing Jin · Aug 16, 2025 · Citations: 0

Pairwise Preference Multi Agent

Game-theoretic interactions between agents with Large Language Models (LLMs) have revealed many emergent capabilities, yet the linguistic diversity of these interactions has not been sufficiently quantified.

Related Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote