Metric Hub

Coherence Metric Papers

Updated from current HFEPX corpus (Feb 27, 2026). 21 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Freeform. Frequently cited benchmark: Retrieval. Common metric signal: coherence. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 21 Last published: Feb 26, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 21 papers for Coherence Metric Papers. Dominant protocol signals include automatic metrics, simulation environments, LLM-as-judge, with frequent benchmark focus on Retrieval, LongBench and metric focus on coherence, accuracy. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

14.3% of papers report explicit human-feedback signals, led by critique/edit feedback.

Evidence: From Pixels to Policies: Reinforcing Spatial Reasoning in Language Models for Content-Aware Layout Design , Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems , ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning , Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning
automatic metrics appears in 71.4% of papers in this hub.

Evidence: Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems , Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning , KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge , Facet-Level Persona Control by Trait-Activated Routing with Contrastive SAE for Role-Playing LLMs
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering , Improving Neural Topic Modeling with Semantically-Grounded Soft Label Distributions , Reinforced Fast Weights with Next-Sequence Prediction , Embodied Task Planning via Graph-Informed Action Generation with Large Language Model

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.

Evidence: Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems , ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning , Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning , KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge
Rater context is mostly domain experts, and annotation is commonly Freeform; use this to scope replication staffing.

Evidence: Document Reconstruction Unlocks Scalable Long-Context RLVR , Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems , ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning , Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Evidence: Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems , ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning , Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning , KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge

Benchmark Interpretation

Retrieval appears in 42.9% of hub papers (9/21); use this cohort for benchmark-matched comparisons.
LongBench appears in 9.5% of hub papers (2/21); use this cohort for benchmark-matched comparisons.

Metric Interpretation

coherence is reported in 100% of hub papers (21/21); compare with a secondary metric before ranking methods.
accuracy is reported in 14.3% of hub papers (3/21); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (14.3% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (0% vs 30% target).
Maintain strength on Papers naming benchmarks/datasets. Coverage is strong (61.9% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (4.8% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (19% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (14.3% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is strong (61.9% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (4.8% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (19% vs 35% target).

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (4.8% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

LLM-as-Judge Protocols - Finds judge-based evaluation setups to compare calibration and drift risks.
Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: coherence - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

llm_as_judge vs automatic_metrics

both=1, left_only=0, right_only=14

1 papers use both Llm As Judge and Automatic Metrics.

automatic_metrics vs simulation_env

both=0, left_only=15, right_only=6

0 papers use both Automatic Metrics and Simulation Env.

simulation_env vs llm_as_judge

both=0, left_only=6, right_only=1

0 papers use both Simulation Env and Llm As Judge.

Benchmark Brief

Retrieval

Coverage: 9 papers (42.9%)

9 papers (42.9%) mention Retrieval.

Examples: Facet-Level Persona Control by Trait-Activated Routing with Contrastive SAE for Role-Playing LLMs , Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering , Improving Neural Topic Modeling with Semantically-Grounded Soft Label Distributions

Benchmark Brief

LongBench

Coverage: 2 papers (9.5%)

2 papers (9.5%) mention LongBench.

Examples: Reinforced Fast Weights with Next-Sequence Prediction , Document Reconstruction Unlocks Scalable Long-Context RLVR

Benchmark Brief

ALFWorld

Coverage: 1 papers (4.8%)

1 papers (4.8%) mention ALFWorld.

Examples: Embodied Task Planning via Graph-Informed Action Generation with Large Language Model

Metric Brief

coherence

Coverage: 21 papers (100%)

21 papers (100%) mention coherence.

Examples: Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems , ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning , Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning

Metric Brief

accuracy

Coverage: 3 papers (14.3%)

3 papers (14.3%) mention accuracy.

Examples: KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge , KD-OCT: Efficient Knowledge Distillation for Clinical-Grade Retinal OCT Classification , Structure-Augmented Reasoning Generation

Metric Brief

cost

Coverage: 2 papers (9.5%)

2 papers (9.5%) mention cost.

Examples: Protecting Language Models Against Unauthorized Distillation through Trace Rewriting , Embodied Task Planning via Graph-Informed Action Generation with Large Language Model

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems , ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning , Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers Reporting This Metric

Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems
Siyuan Liu, Jiahui Xu, Feng Jiang, Kuang Wang, Zefeng Zhao · Feb 26, 2026

Automatic Metrics General

Achieving human-like responsiveness is a critical yet challenging goal for cascaded spoken dialogue systems.
ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning
Hyeonje Choi, Jeongsoo Lee, Hyojun Lee, Jay-Yoon Lee · Feb 24, 2026

Simulation Env Math

We introduce \ToolMATH, a math-grounded benchmark that evaluates tool-augmented language models in realistic multi-tool environments where the output depends on calling schema-specified tools and sustaining multi-step execution.
Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning
Justin Lovelace, Christian Belardi, Sofian Zalouk, Adhitya Polavaram, Srivatsa Kundurthy · Feb 24, 2026

Llm As JudgeAutomatic Metrics General

Evaluations show STAR-LDM significantly outperforms similar-sized models on language understanding benchmarks and achieves $>70\%$ win rates in LLM-as-judge comparisons for narrative coherence and commonsense reasoning.
KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge
Alex Robertson, Huizhi Liang, Mahbub Gani, Rohit Kumar, Srijith Rajamohan · Feb 23, 2026

Automatic Metrics General

Existing benchmarks are limited by static and narrow questions, leading to limited coverage and misleading evaluations.
Facet-Level Persona Control by Trait-Activated Routing with Contrastive SAE for Role-Playing LLMs
Wenqiu Tang, Zhen Wan, Takahiro Komamizu, Ichiro Ide · Feb 22, 2026

Automatic Metrics Coding

Personality control in Role-Playing Agents (RPAs) is commonly achieved via training-free methods that inject persona descriptions and memory through prompts or retrieval-augmented generation, or via supervised fine-tuning (SFT) on persona-s
Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering
Amine Kobeissi, Philippe Langlais · Feb 20, 2026

Automatic Metrics Coding

Retrieval-augmented generation is increasingly used for financial question answering over long regulatory filings, yet reliability depends on retrieving the exact context needed to justify answers in high stakes settings.
Improving Neural Topic Modeling with Semantically-Grounded Soft Label Distributions
Raymond Li, Amirhossein Abaskohi, Chuyuan Li, Gabriel Murray, Giuseppe Carenini · Feb 20, 2026

Automatic Metrics General

Traditional neural topic models are typically optimized by reconstructing the document's Bag-of-Words (BoW) representations, overlooking contextual information and struggling with data sparsity.
AIDG: Evaluating Asymmetry Between Information Extraction and Containment in Multi-Turn Dialogue
Adib Sakhawat, Fardeen Sadab, Rakin Shahriar · Feb 19, 2026

Automatic Metrics General

Evaluating the strategic reasoning capabilities of Large Language Models (LLMs) requires moving beyond static benchmarks to dynamic, multi-turn interactions.
Reinforced Fast Weights with Next-Sequence Prediction
Hee Seung Hwang, Xindi Wu, Sanghyuk Chun, Olga Russakovsky · Feb 18, 2026

Automatic Metrics General

Fast weight architectures offer a promising alternative to attention-based transformers for long-context modeling by maintaining constant memory overhead regardless of context length.
Protecting Language Models Against Unauthorized Distillation through Trace Rewriting
Xinhang Ma, William Yeoh, Ning Zhang, Yevgeniy Vorobeychik · Feb 16, 2026

Automatic Metrics General

Knowledge distillation is a widely adopted technique for transferring capabilities from LLMs to smaller, more efficient student models.
From Pixels to Policies: Reinforcing Spatial Reasoning in Language Models for Content-Aware Layout Design
Sha Li, Stefano Petrangeli, Yu Shen, Xiang Chen · Feb 14, 2026

Simulation Env Coding

We introduce LaySPA, a reinforcement learning framework that equips large language models (LLMs) with explicit and interpretable spatial reasoning for content-aware graphic layout design.
Document Reconstruction Unlocks Scalable Long-Context RLVR
Yao Xiao, Lei Wang, Yue Deng, Guanzheng Chen, Ziqi Jin · Feb 9, 2026

Automatic Metrics Coding

However, it often relies on gold-standard answers or explicit evaluation rubrics provided by powerful teacher models or human experts, which are costly and time-consuming.
How Well Can LLM Agents Simulate End-User Security and Privacy Attitudes and Behaviors?
Yuxuan Li, Leyang Li, Hao-Ping Lee, Sauvik Das · Feb 6, 2026

Simulation Env General

A growing body of research assumes that large language model (LLM) agents can serve as proxies for how people form attitudes toward and behave in response to security and privacy (S&P) threats.
Embodied Task Planning via Graph-Informed Action Generation with Large Language Model
Xiang Li, Ning Yan, Masood Mortazavi · Jan 29, 2026

Simulation Env Coding

While Large Language Models (LLMs) have demonstrated strong zero-shot reasoning capabilities, their deployment as embodied agents still faces fundamental challenges in long-horizon planning.
Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering
Xinyu Zhu, Yuzhu Cai, Zexi Liu, Bingyang Zheng, Cheng Wang · Jan 15, 2026

Simulation Env General

The advancement of artificial intelligence toward agentic science is currently bottlenecked by the challenge of ultra-long-horizon autonomy, the ability to sustain strategic coherence and iterative correction over experimental cycles spanni
KD-OCT: Efficient Knowledge Distillation for Clinical-Grade Retinal OCT Classification
Erfan Nourbakhsh, Nasrin Sanjari, Ali Nourbakhsh · Dec 9, 2025

Automatic Metrics MedicineCoding

Age-related macular degeneration (AMD) and choroidal neovascularization (CNV)-related conditions are leading causes of vision loss worldwide, with optical coherence tomography (OCT) serving as a cornerstone for early detection and managemen
Beyond Fact Retrieval: Episodic Memory for RAG with Generative Semantic Workspaces
Shreyas Rajesh, Pavan Holur, Chenda Duan, David Chong, Vwani Roychowdhury · Nov 10, 2025

Automatic Metrics Coding

On the Episodic Memory Benchmark (EpBench) \cite{huet_episodic_2025} comprising corpora ranging from 100k to 1M tokens in length, GSW outperforms existing RAG based baselines by up to \textbf{20\%}.
Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs
Mohammad Tavakoli, Alireza Salemi, Carrie Ye, Mohamed Abdalla, Hamed Zamani · Oct 31, 2025

Automatic Metrics General

Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative coh
Structure-Augmented Reasoning Generation
Jash Rajesh Parekh, Pengcheng Jiang, Jiawei Han · Jun 10, 2025

Automatic Metrics General

Extensive experiments on open-domain QA benchmarks and specialized reasoning datasets in finance and medicine demonstrate that SARG significantly outperforms state-of-the-art flat-context RAG baselines in both factual accuracy and reasoning
Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models
Yubo Li, Xiaobin Shen, Xinyu Yao, Xueying Ding, Yidi Miao · Apr 7, 2025

Automatic Metrics Math

We organize existing benchmarks and datasets into coherent categories reflecting the evolving landscape of multi-turn dialogue evaluation, and review a broad spectrum of enhancement methodologies, including model-centric strategies (in-cont
EconEvals: Benchmarks and Litmus Tests for Economic Decision-Making by LLM Agents
Sara Fish, Julia Shephard, Minkai Li, Ran I. Shorrer, Yannai A. Gonczarowski · Mar 24, 2025

Simulation Env General

We develop evaluation methods for measuring the economic decision-making capabilities and tendencies of LLMs.

Coherence Metric Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers Reporting This Metric

Other Metric Hubs