HFEPX Metric Hub

Coherence + General Metric Papers

Updated from current HFEPX corpus (Apr 12, 2026). 15 papers are grouped in this metric page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 15 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Llm As Judge. Common annotation unit: Trajectory. Frequently cited benchmark: ALFWorld. Common metric signal: coherence. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Apr 9, 2026.

Papers: 15 Last published: Apr 9, 2026 Global RSS

When This Metric Page Is Useful

Useful for background comparison, but still validate benchmark and protocol details in the linked papers. Quality band: Medium .

Metric Coverage

93.3%

14 sampled papers include metric names.

Benchmark Anchoring

26.7%

Papers with explicit dataset/benchmark anchors for fair comparison.

Quality Controls

0.0%

0 papers report calibration/adjudication/IAA controls.

15 papers are not low-signal flagged in this sample.
Use the protocol matrix below to avoid comparing metrics across incompatible eval setups.

Recommended next step: Treat this as directional signal only; metric reporting is present but benchmark anchoring is still thin.

Main limitation: Benchmark coverage is still thin, so avoid treating this page as a definitive guide to the metric.

What This Metric Page Tells You

40% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 60% of papers in this hub.
ALFWorld is a recurring benchmark anchor for cross-paper comparisons in this page.

Metric Notes (Expanded)

Metric-Driven Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly unspecified rater pools, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Metric Interpretation

coherence is reported in 100% of hub papers (15/15); compare with a secondary metric before ranking methods.
accuracy is reported in 20% of hub papers (3/15); compare with a secondary metric before ranking methods.

Benchmark Context

ALFWorld appears in 6.7% of hub papers (1/15); use this cohort for benchmark-matched comparisons.
MLE-Bench appears in 6.7% of hub papers (1/15); use this cohort for benchmark-matched comparisons.

Start Here (Metric-Reliable First 6)

Ranked for metric reporting completeness and comparability.

$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution
Apr 1, 2026 · Citations: 0 · Score: 8.0

Metrics: Cost, Inference cost · Eval: Automatic Metrics
HyperMem: Hypergraph Memory for Long-Term Conversations
Apr 9, 2026 · Citations: 0 · Score: 7.5

Metrics: Accuracy, Coherence · Eval: Llm As Judge, Automatic Metrics
Embodied Task Planning via Graph-Informed Action Generation with Large Language Model
Jan 29, 2026 · Citations: 0 · Score: 7.5

Metrics: Pass@1, Cost · Eval: Simulation Env
PLOT: Enhancing Preference Learning via Optimal Transport
Apr 2, 2026 · Citations: 0 · Score: 7.5

Metrics: Coherence · Eval: Automatic Metrics
BeliefShift: Benchmarking Temporal Belief Consistency and Opinion Drift in LLM Agents
Mar 25, 2026 · Citations: 0 · Score: 7.5

Metrics: Accuracy, Coherence · Eval: Automatic Metrics
QChunker: Learning Question-Aware Text Chunking for Domain RAG via Multi-Agent Debate
Mar 12, 2026 · Citations: 0 · Score: 7.5

Metrics: Coherence · Eval: Automatic Metrics

Metric Protocol Matrix (Top 10)

Compare metric, benchmark, and evaluation context side by side.

Paper	Metrics	Benchmarks	Eval Modes	Quality Controls
$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution Apr 1, 2026	Cost, Inference cost	Yc Bench	Automatic Metrics	Not reported
HyperMem: Hypergraph Memory for Long-Term Conversations Apr 9, 2026	Accuracy, Coherence	Not reported	Llm As Judge, Automatic Metrics	Not reported
Embodied Task Planning via Graph-Informed Action Generation with Large Language Model Jan 29, 2026	Pass@1, Cost	ALFWorld	Simulation Env	Not reported
PLOT: Enhancing Preference Learning via Optimal Transport Apr 2, 2026	Coherence	Not reported	Automatic Metrics	Not reported
BeliefShift: Benchmarking Temporal Belief Consistency and Opinion Drift in LLM Agents Mar 25, 2026	Accuracy, Coherence	Not reported	Automatic Metrics	Not reported
QChunker: Learning Question-Aware Text Chunking for Domain RAG via Multi-Agent Debate Mar 12, 2026	Coherence	Understanding Retrieval	Automatic Metrics	Not reported
VRM: Teaching Reward Models to Understand Authentic Human Preferences Mar 5, 2026	Coherence	Not reported	Human Eval	Not reported
The Geometry of Dialogue: Graphing Language Models to Reveal Synergistic Teams for Multi-Agent Collaboration Oct 30, 2025	Accuracy, Coherence	Not reported	Automatic Metrics	Not reported
Reasoning or Rhetoric? An Empirical Analysis of Moral Reasoning Explanations in Large Language Models Mar 23, 2026	Coherence	Not reported	Llm As Judge	Not reported
Discourse Coherence and Response-Guided Context Rewriting for Multi-Party Dialogue Generation Apr 8, 2026	Coherence	Not reported	Not reported	Not reported

How To Use This Page

Checklist

Moderate: Papers with explicit human feedback

Coverage is usable but incomplete (40% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (26.7% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (0% vs 35% target).
Strong: Papers with known annotation unit

Coverage is strong (46.7% vs 35% target).

Strengths

Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.
Agentic evaluation appears in 60% of papers.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
LLM-as-judge appears without enough inter-annotator agreement reporting.

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (ALFWorld vs MLE-Bench) before comparing methods.
Track metric sensitivity by reporting both coherence and accuracy.

Recommended Queries

Judge vs Human Agreement Benchmark Slice: ALFWorld Metric Slice: coherence Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Coverage Snapshot

Top Metrics

Coherence (15)
Accuracy (3)
Cost (2)
Inference cost (1)

Evaluation Modes

Automatic Metrics (9)
Llm As Judge (3)
Simulation Env (3)
Human Eval (1)

Top Benchmarks

ALFWorld (1)
MLE Bench (1)
Understanding Retrieval (1)
Yc Bench (1)

Agentic Mix

Long Horizon (6)
Multi Agent (3)

Top Papers Reporting This Metric

HyperMem: Hypergraph Memory for Long-Term Conversations
Juwei Yue, Chuanrui Hu, Jiawei Sheng, Zuyi Zhou, Wenyuan Zhang · Apr 9, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics General

Long-term memory is essential for conversational agents to maintain coherence, track persistent tasks, and provide personalized interactions across extended dialogues.
VRM: Teaching Reward Models to Understand Authentic Human Preferences
Biao Liu, Ning Xu, Junming Yang, Hao Xu, Xin Geng · Mar 5, 2026 · Citations: 0

Human Eval General

Large Language Models (LLMs) have achieved remarkable success across diverse natural language tasks, yet the reward models employed for aligning LLMs often encounter challenges of reward hacking, where the approaches predominantly rely on…
The Geometry of Dialogue: Graphing Language Models to Reveal Synergistic Teams for Multi-Agent Collaboration
Kotaro Furuya, Yuichi Kitagawa · Oct 30, 2025 · Citations: 0

Automatic Metrics General

While a multi-agent approach based on large language models (LLMs) represents a promising strategy to surpass the capabilities of single models, its success is critically dependent on synergistic team composition.
Embodied Task Planning via Graph-Informed Action Generation with Large Language Model
Xiang Li, Ning Yan, Masood Mortazavi · Jan 29, 2026 · Citations: 0

Simulation Env General

We propose GiG, a novel planning framework that structures embodied agents' memory using a Graph-in-Graph architecture.
Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering
Xinyu Zhu, Yuzhu Cai, Zexi Liu, Bingyang Zheng, Cheng Wang · Jan 15, 2026 · Citations: 0

Simulation Env General

The advancement of artificial intelligence toward agentic science is currently bottlenecked by the challenge of ultra-long-horizon autonomy, the ability to sustain strategic coherence and iterative correction over experimental cycles spanni
Reasoning or Rhetoric? An Empirical Analysis of Moral Reasoning Explanations in Large Language Models
Aryan Kasat, Smriti Singh, Aman Chadha, Vinija Jain · Mar 23, 2026 · Citations: 0

Llm As Judge General

Using an LLM-as-judge scoring pipeline validated across three judge models, we classify more than 600 responses from 13 LLMs spanning a range of architectures, parameter scales, and training regimes across six classical moral dilemmas, and…
PLOT: Enhancing Preference Learning via Optimal Transport
Liang Zhu, Yuelin Bai, Xiankun Ren, Jiaxi Yang, Lei Zhang · Apr 2, 2026 · Citations: 0

Automatic Metrics General

Preference learning in Large Language Models (LLMs) has advanced significantly, yet existing methods remain limited by modest performance gains, high computational costs, hyperparameter sensitivity, and insufficient modeling of global…
BeliefShift: Benchmarking Temporal Belief Consistency and Opinion Drift in LLM Agents
Praveen Kumar Myakala, Manan Agrawal, Rahul Manche · Mar 25, 2026 · Citations: 0

Automatic Metrics General

LLMs are increasingly used as long-running conversational agents, yet every major benchmark evaluating their memory treats user information as static facts to be stored and retrieved.
StoryBox: Collaborative Multi-Agent Simulation for Hybrid Bottom-Up Long-Form Story Generation Using Large Language Models
Zehao Chen, Rong Pan, Haoran Li · Oct 13, 2025 · Citations: 0

Simulation Env General

Human writers often begin their stories with an overarching mental scene, where they envision the interactions between characters and their environment.
$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution
Muyu He, Adit Jain, Anand Kumar, Vincent Tu, Soumyadeep Bakshi · Apr 1, 2026 · Citations: 0

Automatic Metrics General

As LLM agents tackle increasingly complex tasks, a critical question is whether they can maintain strategic coherence over long horizons: planning under uncertainty, learning from delayed feedback, and adapting when early mistakes compound.
QChunker: Learning Question-Aware Text Chunking for Domain RAG via Multi-Agent Debate
Jihao Zhao, Daixuan Li, Pengfei Li, Shuaishuai Zu, Biao Qin · Mar 12, 2026 · Citations: 0

Automatic Metrics General

Drawing inspiration from Hal Gregersen's "Questions Are the Answer" theory, we design a multi-agent debate framework comprising four specialized components: a question outline generator, text segmenter, integrity reviewer, and knowledge…
Discourse Coherence and Response-Guided Context Rewriting for Multi-Party Dialogue Generation
Zhiyu Cao, Peifeng Li, Qiaoming Zhu · Apr 8, 2026 · Citations: 0

General

Specifically, DRCR employs two complementary feedback signals, discourse coherence and response quality, to construct preference data for both context rewriting and response generation.
Not All Queries Need Deep Thought: CoFiCot for Adaptive Coarse-to-fine Stateful Refinement
Dongxu Zhang, Hongqiang Lin, Yiding Sun, Pengyu Wang, Qirui Wang · Mar 9, 2026 · Citations: 0

Automatic Metrics General

To address this, we propose CoFiCot, a coarse-to-fine adaptive framework that dynamically tailors inference strategies to problem difficulty.
LayerT2V: A Unified Multi-Layer Video Generation Framework
Guangzhao Li, Kangrui Cen, Baixuan Zhao, Yi Xin, Siqi Luo · Aug 6, 2025 · Citations: 0

Automatic Metrics General

Text-to-video generation has advanced rapidly, but existing methods typically output only the final composited video and lack editable layered representations, limiting their use in professional workflows.
Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning
Justin Lovelace, Christian Belardi, Sofian Zalouk, Adhitya Polavaram, Srivatsa Kundurthy · Feb 24, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics General

Evaluations show STAR-LDM significantly outperforms similar-sized models on language understanding benchmarks and achieves >70\% win rates in LLM-as-judge comparisons for narrative coherence and commonsense reasoning.