Skip to content
← Back to explorer

Metric Hub

Coherence Metric Papers

Updated from current HFEPX corpus (Feb 27, 2026). 21 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Freeform. Frequently cited benchmark: Retrieval. Common metric signal: coherence. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 21 Last published: Feb 26, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 21 papers for Coherence Metric Papers. Dominant protocol signals include automatic metrics, simulation environments, LLM-as-judge, with frequent benchmark focus on Retrieval, LongBench and metric focus on coherence, accuracy. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

  • Retrieval appears in 42.9% of hub papers (9/21); use this cohort for benchmark-matched comparisons.
  • LongBench appears in 9.5% of hub papers (2/21); use this cohort for benchmark-matched comparisons.

Metric Interpretation

  • coherence is reported in 100% of hub papers (21/21); compare with a secondary metric before ranking methods.
  • accuracy is reported in 14.3% of hub papers (3/21); compare with a secondary metric before ranking methods.

Researcher Checklist

  • Close gap on Papers with explicit human feedback. Coverage is a replication risk (14.3% vs 45% target).
  • Close gap on Papers reporting quality controls. Coverage is a replication risk (0% vs 30% target).
  • Maintain strength on Papers naming benchmarks/datasets. Coverage is strong (61.9% vs 35% target).
  • Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
  • Close gap on Papers with known rater population. Coverage is a replication risk (4.8% vs 35% target).
  • Close gap on Papers with known annotation unit. Coverage is a replication risk (19% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (14.3% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is strong (61.9% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (4.8% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (19% vs 35% target).

Suggested Reading Order

  1. 1. Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  2. 2. ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  3. 3. Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  4. 4. KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge

    Adds automatic metrics for broader coverage within this hub.

  5. 5. Facet-Level Persona Control by Trait-Activated Routing with Contrastive SAE for Role-Playing LLMs

    Adds automatic metrics for broader coverage within this hub.

  6. 6. Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering

    Adds automatic metrics for broader coverage within this hub.

  7. 7. Improving Neural Topic Modeling with Semantically-Grounded Soft Label Distributions

    Adds automatic metrics for broader coverage within this hub.

  8. 8. AIDG: Evaluating Asymmetry Between Information Extraction and Containment in Multi-Turn Dialogue

    Adds automatic metrics for broader coverage within this hub.

Known Limitations

  • Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Rater population is under-specified (4.8% coverage).
  • Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

llm_as_judge vs automatic_metrics

both=1, left_only=0, right_only=14

1 papers use both Llm As Judge and Automatic Metrics.

automatic_metrics vs simulation_env

both=0, left_only=15, right_only=6

0 papers use both Automatic Metrics and Simulation Env.

simulation_env vs llm_as_judge

both=0, left_only=6, right_only=1

0 papers use both Simulation Env and Llm As Judge.

Benchmark Brief

LongBench

Coverage: 2 papers (9.5%)

2 papers (9.5%) mention LongBench.

Examples: Reinforced Fast Weights with Next-Sequence Prediction , Document Reconstruction Unlocks Scalable Long-Context RLVR

Benchmark Brief

ALFWorld

Coverage: 1 papers (4.8%)

1 papers (4.8%) mention ALFWorld.

Examples: Embodied Task Planning via Graph-Informed Action Generation with Large Language Model

Top Papers Reporting This Metric

Other Metric Hubs