HFEPX Archive Slice

HFEPX Weekly Archive: 2025-W28

Updated from current HFEPX corpus (Mar 8, 2026). 11 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Mar 8, 2026). 11 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Frequently cited benchmark: Clembench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Jul 11, 2025.

Papers: 11 Last published: Jul 11, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Medium .

High-Signal Coverage

100.0%

11 / 11 papers are not low-signal flagged.

Benchmark Anchors

18.2%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

27.3%

Papers with reported metric mentions in extraction output.

0 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Why This Time Slice Matters

18.2% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 18.2% of papers in this hub.
Clembench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly mixed annotation units; use this to scope replication staffing.
Stratify by benchmark (Clembench vs HotpotQA) before comparing methods.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology
Jul 10, 2025 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: Accuracy
Anthropomimetic Uncertainty: What Verbalized Uncertainty in Language Models is Missing
Jul 11, 2025 · Citations: 0 · Score: 3.5

Eval: Automatic Metrics · Metrics: Accuracy
A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench
Jul 11, 2025 · Citations: 0 · Score: 3.5

Eval: Not reported · Metrics: Not reported
Psychometric Item Validation Using Virtual Respondents with Trait-Response Mediators
Jul 8, 2025 · Citations: 0 · Score: 3.5

Eval: Simulation Env · Metrics: Cost
From Fragments to Facts: A Curriculum-Driven DPO Approach for Generating Hindi News Veracity Explanations
Jul 7, 2025 · Citations: 0 · Score: 2.0

Eval: Not reported · Metrics: Not reported
Mechanistic Indicators of Understanding in Large Language Models
Jul 7, 2025 · Citations: 0 · Score: 1.0

Eval: Not reported · Metrics: Not reported

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology Jul 10, 2025	Automatic Metrics	Treebench	Accuracy	Not reported
Anthropomimetic Uncertainty: What Verbalized Uncertainty in Language Models is Missing Jul 11, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench Jul 11, 2025	Not reported	LMSYS Chatbot Arena, Clembench	Not reported	Not reported
Psychometric Item Validation Using Virtual Respondents with Trait-Response Mediators Jul 8, 2025	Simulation Env	Not reported	Cost	Not reported
From Fragments to Facts: A Curriculum-Driven DPO Approach for Generating Hindi News Veracity Explanations Jul 7, 2025	Not reported	Not reported	Not reported	Not reported
Mechanistic Indicators of Understanding in Large Language Models Jul 7, 2025	Not reported	Not reported	Not reported	Not reported
The Generalization Ridge: Information Flow in Natural Language Generation Jul 7, 2025	Not reported	Not reported	Not reported	Not reported
From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems Jul 10, 2025	Not reported	Not reported	Not reported	Not reported
FrugalRAG: Less is More in RL Finetuning for Multi-Hop Question Answering Jul 10, 2025	Not reported	Not reported	Not reported	Not reported
SpatialViz-Bench: A Cognitively-Grounded Benchmark for Diagnosing Spatial Visualization in MLLMs Jul 10, 2025	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (18.2% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (27.3% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (27.3% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (9.1% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (0% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (9.1% coverage).
Annotation unit is under-specified (0% coverage).

Suggested Next Analyses

Stratify by benchmark (Clembench vs HotpotQA) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.

Recommended Queries

Benchmark Slice: Clembench Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (9.1% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (2)
Simulation Env (1)

Top Metrics

Accuracy (3)
Cost (1)
Relevance (1)

Top Benchmarks

Clembench (1)
HotpotQA (1)
Lm Arena (1)
LMSYS Chatbot Arena (1)

Quality Controls

Papers In This Archive Slice

Anthropomimetic Uncertainty: What Verbalized Uncertainty in Language Models is Missing
Dennis Ulmer, Alexandra Lorson, Ivan Titov, Christian Hardmeier · Jul 11, 2025 · Citations: 0

Human users increasingly communicate with large language models (LLMs), but LLMs suffer from frequent overconfidence in their output, even when its accuracy is questionable, which undermines their trustworthiness and perceived legitimacy.
A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench
David Schlangen, Sherzod Hakimov, Chalamalasetti Kranti, Jonathan Jordan, Philipp Sadler · Jul 11, 2025 · Citations: 0

Pairwise Preference

There are currently two main paradigms for evaluating large language models (LLMs), reference-based evaluation and preference-based evaluation.
Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology
Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang · Jul 10, 2025 · Citations: 0

To bridge this gap, we propose TreeBench (Traceable Evidence Evaluation Benchmark), a diagnostic benchmark built on three principles: (1) focused visual perception of subtle targets in complex scenes, (2) traceable evidence via bounding box…
From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems
Youngjoon Jang, Seongtae Hong, Junyoung Son, Sungjin Park, Chanjun Park · Jul 10, 2025 · Citations: 0
FrugalRAG: Less is More in RL Finetuning for Multi-Hop Question Answering
Abhinav Java, Srivathsan Koundinyan, Nagarajan Natarajan, Amit Sharma · Jul 10, 2025 · Citations: 0
SpatialViz-Bench: A Cognitively-Grounded Benchmark for Diagnosing Spatial Visualization in MLLMs
Siting Wang, Minnan Pei, Luoyang Sun, Cheng Deng, Yuchen Li · Jul 10, 2025 · Citations: 0
Psychometric Item Validation Using Virtual Respondents with Trait-Response Mediators
Sungjib Lim, Woojung Song, Eun-Ju Lee, Yohan Jo · Jul 8, 2025 · Citations: 0

Traditionally, this requires costly, large-scale human data collection.
Mechanistic Indicators of Understanding in Large Language Models
Pierre Beckmann, Matthieu Queloz · Jul 7, 2025 · Citations: 0

However, these also diverge from human cognition in their parallel exploitation of heterogeneous mechanisms.
The Generalization Ridge: Information Flow in Natural Language Generation
Ruidi Chang, Chunyuan Deng, Hanjie Chen · Jul 7, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
From Fragments to Facts: A Curriculum-Driven DPO Approach for Generating Hindi News Veracity Explanations
Pulkit Bansal, Raghvendra Kumar, Shakti Singh, Sriparna Saha, Adam Jatowt · Jul 7, 2025 · Citations: 0

Pairwise Preference

To bridge this gap, we propose a novel framework integrating Direct Preference Optimization (DPO) with curriculum learning to align machine-generated explanations with human reasoning.
Agentic Vehicles for Human-Centered Mobility
Jiangbo Yu, Raphael Frank, Luis Miranda-Moreno, Sasan Jafarnejad, Jonatas Augusto Manzolli · Jul 7, 2025 · Citations: 0

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote