Metric Hub

Faithfulness + Automatic Metrics Metric Papers

Updated from current HFEPX corpus (Feb 27, 2026). 10 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Frequently cited benchmark: Retrieval. Common metric signal: faithfulness. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 10 Last published: Feb 26, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 10 papers for Faithfulness + Automatic Metrics Metric Papers. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on Retrieval, BIG-Bench and metric focus on faithfulness, accuracy. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

automatic metrics appears in 100% of papers in this hub.

Evidence: Probing for Knowledge Attribution in Large Language Models , Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA , Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models , Causal Decoding for Hallucination-Resistant Multimodal Large Language Models
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA , MemoTime: Memory-Augmented Temporal Knowledge Graph Enhanced Large Language Model Reasoning , Probing for Knowledge Attribution in Large Language Models , Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.

Evidence: Probing for Knowledge Attribution in Large Language Models , Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA , Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models , Causal Decoding for Hallucination-Resistant Multimodal Large Language Models
Rater context is mostly domain experts, and annotation is commonly mixed annotation units; use this to scope replication staffing.

Evidence: Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA , Probing for Knowledge Attribution in Large Language Models , Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models , Causal Decoding for Hallucination-Resistant Multimodal Large Language Models
Stratify by benchmark (Retrieval vs BIG-Bench) before comparing methods.

Evidence: Probing for Knowledge Attribution in Large Language Models , Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA , Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models , Causal Decoding for Hallucination-Resistant Multimodal Large Language Models

Benchmark Interpretation

Retrieval appears in 20% of hub papers (2/10); use this cohort for benchmark-matched comparisons.
BIG-Bench appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.

Metric Interpretation

faithfulness is reported in 100% of hub papers (10/10); compare with a secondary metric before ranking methods.
accuracy is reported in 50% of hub papers (5/10); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (0% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (0% vs 30% target).
Maintain strength on Papers naming benchmarks/datasets. Coverage is strong (70% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (10% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (0% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (0% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is strong (70% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (10% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (0% vs 35% target).

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (10% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: faithfulness - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

automatic_metrics vs simulation_env

both=1, left_only=9, right_only=0

1 papers use both Automatic Metrics and Simulation Env.

Benchmark Brief

Retrieval

Coverage: 2 papers (20%)

2 papers (20%) mention Retrieval.

Examples: Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA , MemoTime: Memory-Augmented Temporal Knowledge Graph Enhanced Large Language Model Reasoning

Benchmark Brief

BIG-Bench

Coverage: 1 papers (10%)

1 papers (10%) mention BIG-Bench.

Examples: Balancing Faithfulness and Performance in Reasoning via Multi-Listener Soft Execution

Benchmark Brief

DROP

Coverage: 1 papers (10%)

1 papers (10%) mention DROP.

Examples: Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models

Metric Brief

faithfulness

Coverage: 10 papers (100%)

10 papers (100%) mention faithfulness.

Examples: Probing for Knowledge Attribution in Large Language Models , Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA , Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models

Metric Brief

accuracy

Coverage: 5 papers (50%)

5 papers (50%) mention accuracy.

Examples: Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA , Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models , Counterfactual Simulation Training for Chain-of-Thought Faithfulness

Metric Brief

Coverage: 1 papers (10%)

1 papers (10%) mention f1.

Examples: Probing for Knowledge Attribution in Large Language Models

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Probing for Knowledge Attribution in Large Language Models , Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA , Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers Reporting This Metric

Probing for Knowledge Attribution in Large Language Models
Ivo Brink, Alexander Boer, Dennis Ulmer · Feb 26, 2026

Automatic Metrics General

Probes trained on AttriWiki data reveal a strong attribution signal, achieving up to 0.96 Macro-F1 on Llama-3.1-8B, Mistral-7B, and Qwen-7B, transferring to out-of-domain benchmarks (SQuAD, WebQuestions) with 0.94-0.99 Macro-F1 without retr
Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA
Wenwei Li, Ming Xu, Tianle Xia, Lingxiang Hu, Yiding Sun · Feb 26, 2026

Automatic Metrics Law

We propose a reinforced co-adaptation framework that jointly optimizes retrieval and generation through two components: (1) Graph-aware Retrieval (GraphRAG), which models entity-relation structure over a high-citation knowledge subgraph for
Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models
Christian Nickel, Laura Schrewe, Florian Mai, Lucie Flek · Feb 25, 2026

Automatic Metrics General

Theory of Mind (ToM) refers to an agent's ability to model the internal states of others.
Causal Decoding for Hallucination-Resistant Multimodal Large Language Models
Shiwei Tan, Hengyi Wang, Weiyi Qin, Qi Xu, Zhigang Hua · Feb 24, 2026

Automatic Metrics General

Across captioning and QA benchmarks, our framework substantially lowers object-hallucination rates and achieves state-of-the-art faithfulness without degrading overall output quality.
Counterfactual Simulation Training for Chain-of-Thought Faithfulness
Peter Hase, Christopher Potts · Feb 24, 2026

Automatic MetricsSimulation Env Coding

Inspecting Chain-of-Thought reasoning is among the most common means of understanding why an LLM produced its output.
RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models
Yunseok Han, Yejoon Lee, Jaeyoung Do · Feb 19, 2026

Automatic Metrics MathCoding

To operationalize this, we present RFEval, a benchmark of 7,186 instances across seven tasks that probes faithfulness via controlled, output-level counterfactual interventions.
Balancing Faithfulness and Performance in Reasoning via Multi-Listener Soft Execution
Nithin Sivakumaran, Shoubin Yu, Hyunji Lee, Yue Zhang, Ali Payani · Feb 18, 2026

Automatic Metrics General

On multiple reasoning benchmarks (BIG-Bench Extra Hard, MuSR, ZebraLogicBench, and FOLIO), REMUL consistently and substantially improves three measures of faithfulness -- hint attribution, early answering area over the curve (AOC), and mist
RPTS: Tree-Structured Reasoning Process Scoring for Faithful Multimodal Evaluation
Haofeng Wang, Yu Zhang · Nov 10, 2025

Automatic Metrics Law

Large Vision-Language Models (LVLMs) excel in multimodal reasoning and have shown impressive performance on various multimodal benchmarks.
MemoTime: Memory-Augmented Temporal Knowledge Graph Enhanced Large Language Model Reasoning
Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu, Xin Yuan · Oct 15, 2025

Automatic Metrics General

Comprehensive experiments on multiple temporal QA benchmarks show that MemoTime achieves overall state-of-the-art results, outperforming the strong baseline by up to 24.0%.
Towards Attributions of Input Variables in a Coalition
Xinhao Zheng, Huiqi Deng, Quanshi Zhang · Sep 23, 2023

Automatic Metrics General

Experiments on synthetic data, NLP, image classification, and the game of Go validate our approach, demonstrating consistency with human intuition and practical applicability.

Faithfulness + Automatic Metrics Metric Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers Reporting This Metric

Other Metric Hubs