Metric Hub

Coherence + General Metric Papers

Updated from current HFEPX corpus (Feb 27, 2026). 12 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Simulation Env. Common annotation unit: Trajectory. Frequently cited benchmark: Retrieval. Common metric signal: coherence. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 12 Last published: Feb 26, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 12 papers for Coherence + General Metric Papers. Dominant protocol signals include automatic metrics, simulation environments, LLM-as-judge, with frequent benchmark focus on Retrieval, Kghalubench and metric focus on coherence, accuracy. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

automatic metrics appears in 75% of papers in this hub.

Evidence: Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems , Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning , KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge , Improving Neural Topic Modeling with Semantically-Grounded Soft Label Distributions
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Improving Neural Topic Modeling with Semantically-Grounded Soft Label Distributions , Reinforced Fast Weights with Next-Sequence Prediction , Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs , Structure-Augmented Reasoning Generation
long-horizon tasks appears in 8.3% of papers, indicating agentic evaluation demand.

Evidence: Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering , Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems , Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning , KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.

Evidence: Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems , Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning , KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge , Improving Neural Topic Modeling with Semantically-Grounded Soft Label Distributions
Rater context is mostly unspecified rater pools, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.

Evidence: Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems , Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning , KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge , Improving Neural Topic Modeling with Semantically-Grounded Soft Label Distributions
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Evidence: Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems , Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning , KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge , Improving Neural Topic Modeling with Semantically-Grounded Soft Label Distributions

Benchmark Interpretation

Retrieval appears in 33.3% of hub papers (4/12); use this cohort for benchmark-matched comparisons.
Kghalubench appears in 8.3% of hub papers (1/12); use this cohort for benchmark-matched comparisons.

Metric Interpretation

coherence is reported in 100% of hub papers (12/12); compare with a secondary metric before ranking methods.
accuracy is reported in 16.7% of hub papers (2/12); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (0% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (0% vs 30% target).
Maintain strength on Papers naming benchmarks/datasets. Coverage is strong (58.3% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (0% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (8.3% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (0% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is strong (58.3% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (0% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (8.3% vs 35% target).

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

LLM-as-Judge Protocols - Finds judge-based evaluation setups to compare calibration and drift risks.
Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: coherence - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

llm_as_judge vs automatic_metrics

both=1, left_only=0, right_only=8

1 papers use both Llm As Judge and Automatic Metrics.

automatic_metrics vs simulation_env

both=0, left_only=9, right_only=3

0 papers use both Automatic Metrics and Simulation Env.

simulation_env vs llm_as_judge

both=0, left_only=3, right_only=1

0 papers use both Simulation Env and Llm As Judge.

Benchmark Brief

Retrieval

Coverage: 4 papers (33.3%)

4 papers (33.3%) mention Retrieval.

Examples: Improving Neural Topic Modeling with Semantically-Grounded Soft Label Distributions , Reinforced Fast Weights with Next-Sequence Prediction , Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs

Benchmark Brief

Kghalubench

Coverage: 1 papers (8.3%)

1 papers (8.3%) mention Kghalubench.

Examples: KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge

Benchmark Brief

LongBench

Coverage: 1 papers (8.3%)

1 papers (8.3%) mention LongBench.

Examples: Reinforced Fast Weights with Next-Sequence Prediction

Metric Brief

coherence

Coverage: 12 papers (100%)

12 papers (100%) mention coherence.

Examples: Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems , Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning , KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge

Metric Brief

accuracy

Coverage: 2 papers (16.7%)

2 papers (16.7%) mention accuracy.

Examples: KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge , Structure-Augmented Reasoning Generation

Metric Brief

context length

Coverage: 1 papers (8.3%)

1 papers (8.3%) mention context length.

Examples: Reinforced Fast Weights with Next-Sequence Prediction

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems , Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning , KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers Reporting This Metric

Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems
Siyuan Liu, Jiahui Xu, Feng Jiang, Kuang Wang, Zefeng Zhao · Feb 26, 2026

Automatic Metrics General

Achieving human-like responsiveness is a critical yet challenging goal for cascaded spoken dialogue systems.
Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning
Justin Lovelace, Christian Belardi, Sofian Zalouk, Adhitya Polavaram, Srivatsa Kundurthy · Feb 24, 2026

Llm As JudgeAutomatic Metrics General

Evaluations show STAR-LDM significantly outperforms similar-sized models on language understanding benchmarks and achieves $>70\%$ win rates in LLM-as-judge comparisons for narrative coherence and commonsense reasoning.
KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge
Alex Robertson, Huizhi Liang, Mahbub Gani, Rohit Kumar, Srijith Rajamohan · Feb 23, 2026

Automatic Metrics General

Existing benchmarks are limited by static and narrow questions, leading to limited coverage and misleading evaluations.
Improving Neural Topic Modeling with Semantically-Grounded Soft Label Distributions
Raymond Li, Amirhossein Abaskohi, Chuyuan Li, Gabriel Murray, Giuseppe Carenini · Feb 20, 2026

Automatic Metrics General

Traditional neural topic models are typically optimized by reconstructing the document's Bag-of-Words (BoW) representations, overlooking contextual information and struggling with data sparsity.
AIDG: Evaluating Asymmetry Between Information Extraction and Containment in Multi-Turn Dialogue
Adib Sakhawat, Fardeen Sadab, Rakin Shahriar · Feb 19, 2026

Automatic Metrics General

Evaluating the strategic reasoning capabilities of Large Language Models (LLMs) requires moving beyond static benchmarks to dynamic, multi-turn interactions.
Reinforced Fast Weights with Next-Sequence Prediction
Hee Seung Hwang, Xindi Wu, Sanghyuk Chun, Olga Russakovsky · Feb 18, 2026

Automatic Metrics General

Fast weight architectures offer a promising alternative to attention-based transformers for long-context modeling by maintaining constant memory overhead regardless of context length.
Protecting Language Models Against Unauthorized Distillation through Trace Rewriting
Xinhang Ma, William Yeoh, Ning Zhang, Yevgeniy Vorobeychik · Feb 16, 2026

Automatic Metrics General

Knowledge distillation is a widely adopted technique for transferring capabilities from LLMs to smaller, more efficient student models.
How Well Can LLM Agents Simulate End-User Security and Privacy Attitudes and Behaviors?
Yuxuan Li, Leyang Li, Hao-Ping Lee, Sauvik Das · Feb 6, 2026

Simulation Env General

A growing body of research assumes that large language model (LLM) agents can serve as proxies for how people form attitudes toward and behave in response to security and privacy (S&P) threats.
Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering
Xinyu Zhu, Yuzhu Cai, Zexi Liu, Bingyang Zheng, Cheng Wang · Jan 15, 2026

Simulation Env General

The advancement of artificial intelligence toward agentic science is currently bottlenecked by the challenge of ultra-long-horizon autonomy, the ability to sustain strategic coherence and iterative correction over experimental cycles spanni
Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs
Mohammad Tavakoli, Alireza Salemi, Carrie Ye, Mohamed Abdalla, Hamed Zamani · Oct 31, 2025

Automatic Metrics General

Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative coh
Structure-Augmented Reasoning Generation
Jash Rajesh Parekh, Pengcheng Jiang, Jiawei Han · Jun 10, 2025

Automatic Metrics General

Extensive experiments on open-domain QA benchmarks and specialized reasoning datasets in finance and medicine demonstrate that SARG significantly outperforms state-of-the-art flat-context RAG baselines in both factual accuracy and reasoning
EconEvals: Benchmarks and Litmus Tests for Economic Decision-Making by LLM Agents
Sara Fish, Julia Shephard, Minkai Li, Ran I. Shorrer, Yannai A. Gonczarowski · Mar 24, 2025

Simulation Env General

We develop evaluation methods for measuring the economic decision-making capabilities and tendencies of LLMs.

Coherence + General Metric Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers Reporting This Metric

Other Metric Hubs