HFEPX Metric Hub

Coherence + Pairwise Preference Metric Papers

Updated from current HFEPX corpus (Apr 27, 2026). 10 papers are grouped in this metric page.

Read Full Context

Updated from current HFEPX corpus (Apr 27, 2026). 10 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Human Eval. Common annotation unit: Pairwise. Common metric signal: coherence. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Apr 9, 2026.

Papers: 10 Last published: Apr 9, 2026 Global RSS

When This Metric Page Is Useful

Context-only for now. This page is not strong enough to justify metric decisions on its own. Quality band: Developing .

Metric Coverage

100.0%

10 sampled papers include metric names.

Benchmark Anchoring

0.0%

Papers with explicit dataset/benchmark anchors for fair comparison.

Quality Controls

0.0%

0 papers report calibration/adjudication/IAA controls.

10 papers are not low-signal flagged in this sample.
Use the protocol matrix below to avoid comparing metrics across incompatible eval setups.

Recommended next step: Treat this as directional signal only; metric reporting is present but benchmark anchoring is still thin.

Main limitation: Benchmark coverage is still thin, so avoid treating this page as a definitive guide to the metric.

What This Metric Page Tells You

100% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 70% of papers in this hub.
multi-agent setups appears in 10% of papers, indicating agentic evaluation demand.

Metric Notes (Expanded)

Metric-Driven Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly unspecified rater pools, and annotation is commonly pairwise annotation; use this to scope replication staffing.
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Metric Interpretation

coherence is reported in 100% of hub papers (10/10); compare with a secondary metric before ranking methods.
accuracy is reported in 50% of hub papers (5/10); compare with a secondary metric before ranking methods.

Start Here (Metric-Reliable First 6)

Ranked for metric reporting completeness and comparability.

HyperMem: Hypergraph Memory for Long-Term Conversations
Apr 9, 2026 · Citations: 0 · Score: 7.0

Metrics: Accuracy, Coherence · Eval: Llm As Judge, Automatic Metrics
Towards Reward Modeling for AI Tutors in Math Mistake Remediation
Mar 25, 2026 · Citations: 0 · Score: 7.0

Metrics: Accuracy, Coherence · Eval: Automatic Metrics
Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization
Apr 24, 2026 · Citations: 0 · Score: 7.0

Metrics: Coherence · Eval: Automatic Metrics
PLOT: Enhancing Preference Learning via Optimal Transport
Apr 2, 2026 · Citations: 0 · Score: 7.0

Metrics: Coherence · Eval: Automatic Metrics
BeliefShift: Benchmarking Temporal Belief Consistency and Opinion Drift in LLM Agents
Mar 25, 2026 · Citations: 0 · Score: 7.0

Metrics: Accuracy, Coherence · Eval: Automatic Metrics
VRM: Teaching Reward Models to Understand Authentic Human Preferences
Mar 5, 2026 · Citations: 0 · Score: 6.5

Metrics: Coherence · Eval: Human Eval

Metric Protocol Matrix (Top 10)

Compare metric, benchmark, and evaluation context side by side.

Paper	Metrics	Benchmarks	Eval Modes	Quality Controls
HyperMem: Hypergraph Memory for Long-Term Conversations Apr 9, 2026	Accuracy, Coherence	Not reported	Llm As Judge, Automatic Metrics	Not reported
Towards Reward Modeling for AI Tutors in Math Mistake Remediation Mar 25, 2026	Accuracy, Coherence	Not reported	Automatic Metrics	Not reported
Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization Apr 24, 2026	Coherence	Not reported	Automatic Metrics	Not reported
PLOT: Enhancing Preference Learning via Optimal Transport Apr 2, 2026	Coherence	Not reported	Automatic Metrics	Not reported
BeliefShift: Benchmarking Temporal Belief Consistency and Opinion Drift in LLM Agents Mar 25, 2026	Accuracy, Coherence	Not reported	Automatic Metrics	Not reported
VRM: Teaching Reward Models to Understand Authentic Human Preferences Mar 5, 2026	Coherence	Not reported	Human Eval	Not reported
PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses Mar 11, 2026	Accuracy, Spearman	Not reported	Automatic Metrics	Not reported
The Geometry of Dialogue: Graphing Language Models to Reveal Synergistic Teams for Multi-Agent Collaboration Oct 30, 2025	Accuracy, Coherence	Not reported	Automatic Metrics	Not reported
Discourse Coherence and Response-Guided Context Rewriting for Multi-Party Dialogue Generation Apr 8, 2026	Coherence	Not reported	Not reported	Not reported
XtraGPT: Context-Aware and Controllable Academic Paper Revision via Human-AI Collaboration May 16, 2025	Coherence	Not reported	Human Eval	Not reported

How To Use This Page

Checklist

Strong: Papers with explicit human feedback

Coverage is strong (100% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (0% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (0% vs 35% target).
Strong: Papers with known annotation unit

Coverage is strong (50% vs 35% target).

Strengths

Strong human-feedback signal (100% of papers).
Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Benchmark coverage is thin (0% of papers mention benchmarks/datasets).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Track metric sensitivity by reporting both coherence and accuracy.

Recommended Queries

Judge vs Human Agreement Metric Slice: coherence Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Coverage Snapshot

Top Metrics

Coherence (10)
Accuracy (5)
Conciseness (1)
Relevance (1)

Evaluation Modes

Automatic Metrics (7)
Human Eval (2)
Llm As Judge (1)

Top Benchmarks

Agentic Mix

Multi Agent (1)

Top Papers Reporting This Metric

HyperMem: Hypergraph Memory for Long-Term Conversations
Juwei Yue, Chuanrui Hu, Jiawei Sheng, Zuyi Zhou, Wenyuan Zhang · Apr 9, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics General

Long-term memory is essential for conversational agents to maintain coherence, track persistent tasks, and provide personalized interactions across extended dialogues.
VRM: Teaching Reward Models to Understand Authentic Human Preferences
Biao Liu, Ning Xu, Junming Yang, Hao Xu, Xin Geng · Mar 5, 2026 · Citations: 0

Human Eval General

Large Language Models (LLMs) have achieved remarkable success across diverse natural language tasks, yet the reward models employed for aligning LLMs often encounter challenges of reward hacking, where the approaches predominantly rely on…
The Geometry of Dialogue: Graphing Language Models to Reveal Synergistic Teams for Multi-Agent Collaboration
Kotaro Furuya, Yuichi Kitagawa · Oct 30, 2025 · Citations: 0

Automatic Metrics General

While a multi-agent approach based on large language models (LLMs) represents a promising strategy to surpass the capabilities of single models, its success is critically dependent on synergistic team composition.
XtraGPT: Context-Aware and Controllable Academic Paper Revision via Human-AI Collaboration
Nuo Chen, Andre Lin HuiKai, Jiaying Wu, Junyi Hou, Zining Zhang · May 16, 2025 · Citations: 0

Human Eval Coding

To address these scenarios, we propose a human-AI collaboration framework for academic paper revision, centered on criteria-guided intent alignment and context-aware modeling.
Towards Reward Modeling for AI Tutors in Math Mistake Remediation
Kseniia Petukhova, Ekaterina Kochmar · Mar 25, 2026 · Citations: 0

Automatic Metrics Math

We develop and release Bradley-Terry preference models trained on weighted-sum rankings that we automatically create from MRBench, synthetic pairs, and data combinations.
PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses
Minki Hong, Eunsoo Lee, Sohyun Park, Jihie Kim · Mar 11, 2026 · Citations: 0

Automatic Metrics Medicine

We propose PEEM (Prompt Engineering Evaluation Metrics), a unified framework for joint and interpretable evaluation of both prompts and responses.
Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization
Weixu Zhang, Ye Yuan, Changjiang Han, Yuxing Tian, Zipeng Sun · Apr 24, 2026 · Citations: 0

Automatic Metrics General

In this work, we adopt a mechanistic interpretability perspective and hypothesize the existence of a sparse set of Preference Heads, attention heads that encode user specific stylistic and topical preferences and exert a causal influence on…
PLOT: Enhancing Preference Learning via Optimal Transport
Liang Zhu, Yuelin Bai, Xiankun Ren, Jiaxi Yang, Lei Zhang · Apr 2, 2026 · Citations: 0

Automatic Metrics General

Preference learning in Large Language Models (LLMs) has advanced significantly, yet existing methods remain limited by modest performance gains, high computational costs, hyperparameter sensitivity, and insufficient modeling of global…
BeliefShift: Benchmarking Temporal Belief Consistency and Opinion Drift in LLM Agents
Praveen Kumar Myakala, Manan Agrawal, Rahul Manche · Mar 25, 2026 · Citations: 0

Automatic Metrics General

LLMs are increasingly used as long-running conversational agents, yet every major benchmark evaluating their memory treats user information as static facts to be stored and retrieved.
Discourse Coherence and Response-Guided Context Rewriting for Multi-Party Dialogue Generation
Zhiyu Cao, Peifeng Li, Qiaoming Zhu · Apr 8, 2026 · Citations: 0

General

Specifically, DRCR employs two complementary feedback signals, discourse coherence and response quality, to construct preference data for both context rewriting and response generation.