Daily Archive

HFEPX Weekly Archive: 2025-W41

Updated from current HFEPX corpus (Feb 27, 2026). 20 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Calibration. Frequently cited benchmark: AlpacaEval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Oct 12, 2025.

Papers: 20 Last published: Oct 12, 2025 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 20 papers for HFEPX Weekly Archive: 2025-W41. Dominant protocol signals include automatic metrics, human evaluation, simulation environments, with frequent benchmark focus on AlpacaEval, Arena-Hard and metric focus on accuracy, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

15% of papers report explicit human-feedback signals, led by critique/edit feedback.

Evidence: FML-bench: Benchmarking Machine Learning Agents for Scientific Research , Mapping Semantic & Syntactic Relationships with Geometric Rotation , The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach , Chlorophyll-a Mapping and Prediction in the Mar Menor Lagoon Using C2RCC-Processed Sentinel 2 Imagery
automatic metrics appears in 100% of papers in this hub.

Evidence: FML-bench: Benchmarking Machine Learning Agents for Scientific Research , Mapping Semantic & Syntactic Relationships with Geometric Rotation , The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach , Chlorophyll-a Mapping and Prediction in the Mar Menor Lagoon Using C2RCC-Processed Sentinel 2 Imagery
AlpacaEval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Revisiting Self-Play Preference Optimization: On the Role of Prompt Difficulty , FML-bench: Benchmarking Machine Learning Agents for Scientific Research , Mapping Semantic & Syntactic Relationships with Geometric Rotation , The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach

Protocol Takeaways

Most common quality-control signal is rater calibration (5% of papers).

Evidence: Chlorophyll-a Mapping and Prediction in the Mar Menor Lagoon Using C2RCC-Processed Sentinel 2 Imagery , FML-bench: Benchmarking Machine Learning Agents for Scientific Research , Mapping Semantic & Syntactic Relationships with Geometric Rotation , The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.

Evidence: Everything is Plausible: Investigating the Impact of LLM Rationales on Human Notions of Plausibility , FML-bench: Benchmarking Machine Learning Agents for Scientific Research , Mapping Semantic & Syntactic Relationships with Geometric Rotation , The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Evidence: Peeking inside the Black-Box: Reinforcement Learning for Explainable and Accurate Relation Extraction , FML-bench: Benchmarking Machine Learning Agents for Scientific Research , Mapping Semantic & Syntactic Relationships with Geometric Rotation , The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach

Benchmark Interpretation

AlpacaEval appears in 5% of hub papers (1/20); use this cohort for benchmark-matched comparisons.
Arena-Hard appears in 5% of hub papers (1/20); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 25% of hub papers (5/20); compare with a secondary metric before ranking methods.
cost is reported in 10% of hub papers (2/20); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (15% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (5% vs 30% target).
Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (25% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (50% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (10% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (10% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (15% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (5% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (25% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (50% vs 35% target).

Papers with known rater population

Coverage is a replication risk (10% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (10% vs 35% target).

Known Limitations

Only 5% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (10% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Human Eval Protocols - Surfaces human-rating workflows for rubric and annotator quality analysis.
Benchmark Slice: AlpacaEval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs automatic_metrics

both=1, left_only=0, right_only=19

1 papers use both Human Eval and Automatic Metrics.

automatic_metrics vs simulation_env

both=1, left_only=19, right_only=0

1 papers use both Automatic Metrics and Simulation Env.

human_eval vs simulation_env

both=0, left_only=1, right_only=1

0 papers use both Human Eval and Simulation Env.

Benchmark Brief

AlpacaEval

Coverage: 1 papers (5%)

1 papers (5%) mention AlpacaEval.

Examples: Revisiting Self-Play Preference Optimization: On the Role of Prompt Difficulty

Benchmark Brief

Arena-Hard

Coverage: 1 papers (5%)

1 papers (5%) mention Arena-Hard.

Examples: Revisiting Self-Play Preference Optimization: On the Role of Prompt Difficulty

Benchmark Brief

Fml-Bench

Coverage: 1 papers (5%)

1 papers (5%) mention Fml-Bench.

Examples: FML-bench: Benchmarking Machine Learning Agents for Scientific Research

Metric Brief

accuracy

Coverage: 5 papers (25%)

5 papers (25%) mention accuracy.

Examples: The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach , Lossless Vocabulary Reduction for Auto-Regressive Language Models , EconCausal: A Context-Aware Causal Reasoning Benchmark for Large Language Models in Social Science

Metric Brief

cost

Coverage: 2 papers (10%)

2 papers (10%) mention cost.

Examples: FML-bench: Benchmarking Machine Learning Agents for Scientific Research , EconCausal: A Context-Aware Causal Reasoning Benchmark for Large Language Models in Social Science

Metric Brief

calibration

Coverage: 1 papers (5%)

1 papers (5%) mention calibration.

Examples: Chlorophyll-a Mapping and Prediction in the Mar Menor Lagoon Using C2RCC-Processed Sentinel 2 Imagery

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: FML-bench: Benchmarking Machine Learning Agents for Scientific Research , Mapping Semantic & Syntactic Relationships with Geometric Rotation , The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Papers Published On This Date

FML-bench: Benchmarking Machine Learning Agents for Scientific Research
Qiran Zou, Hou Hei Lam, Wenhao Zhao, Yiming Tang, Tingting Chen · Oct 12, 2025

Large language models (LLMs) have sparked growing interest in machine learning research agents that can autonomously propose ideas and conduct experiments.
Mapping Semantic & Syntactic Relationships with Geometric Rotation
Michael Freenor, Lauren Alvarez · Oct 10, 2025

Demonstrations

Understanding how language and embedding models encode semantic relationships is fundamental to model interpretability.
The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach
Nizar El Ghazal, Antoine Caubrière, Valentin Vielzeuf · Oct 10, 2025

This paper presents a comparative study of context management strategies for end-to-end Spoken Dialog State Tracking using Speech-LLMs.
Chlorophyll-a Mapping and Prediction in the Mar Menor Lagoon Using C2RCC-Processed Sentinel 2 Imagery
Antonio Martínez-Ibarra, Aurora González-Vidal, Adrián Cánovas-Rodríguez, Antonio F. Skarmeta · Oct 10, 2025

The Mar Menor, Europe's largest hypersaline coastal lagoon, located in southeastern Spain, has undergone severe eutrophication crises, with devastating impacts on biodiversity and water quality.
Verifying Chain-of-Thought Reasoning via Its Computational Graph
Zheng Zhao, Yeskendir Koishekenov, Xianjun Yang, Naila Murray, Nicola Cancedda · Oct 10, 2025

Current Chain-of-Thought (CoT) verification methods predict reasoning correctness based on outputs (black-box) or activations (gray-box), but offer limited insight into why a computation fails.
Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs
Yumin Choi, Dongki Kim, Jinheon Baek, Sung Ju Hwang · Oct 10, 2025

To tackle this problem, we then propose the Multimodal Prompt Optimizer (MPO), a unified framework that not only performs the joint optimization of multimodal prompts through alignment-preserving updates but also guides the selection proces
FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs
Yan Wang, Keyi Wang, Shanshan Yang, Jaisal Patel, Jeff Zhao · Oct 10, 2025

We introduce FinAuditing, a taxonomy-aligned, structure-aware benchmark built from real XBRL filings.
Lossless Vocabulary Reduction for Auto-Regressive Language Models
Daiki Chijiwa, Taku Hasegawa, Kyosuke Nishida, Shin'ya Yamaguchi, Tomoya Ohba · Oct 9, 2025

Tokenization -- the process of decomposing a given text into a sequence of subwords called tokens -- is one of the key components in the development of language models.
Everything is Plausible: Investigating the Impact of LLM Rationales on Human Notions of Plausibility
Shramay Palta, Peter Rankel, Sarah Wiegreffe, Rachel Rudinger · Oct 9, 2025

We investigate the degree to which human plausibility judgments of multiple-choice commonsense benchmark answers are subject to influence by (im)plausibility arguments for or against an answer, in particular, using rationales generated by L
PATCH: Mitigating PII Leakage in Language Models with Privacy-Aware Targeted Circuit PatcHing
Anthony Hughes, Vasisht Duddu, N. Asokan, Nikolaos Aletras, Ning Ma · Oct 8, 2025

Language models (LMs) may memorize personally identifiable information (PII) from training data, enabling adversaries to extract it during inference.
EconCausal: A Context-Aware Causal Reasoning Benchmark for Large Language Models in Social Science
Donggyu Lee, Hyeok Yun, Meeyoung Cha, Sungwon Park, Sangyoon Park · Oct 8, 2025

To address this, we introduce EconCausal, a large-scale benchmark comprising 10,490 context-annotated causal triplets extracted from 2,595 high-quality empirical studies published in top-tier economics and finance journals.
Multi-hop Deep Joint Source-Channel Coding with Deep Hash Distillation for Semantically Aligned Image Recovery
Didrik Bergström, Deniz Gündüz, Onur Günlü · Oct 8, 2025

We consider image transmission via deep joint source-channel coding (DeepJSCC) over multi-hop additive white Gaussian noise (AWGN) channels by training a DeepJSCC encoder-decoder pair with a pre-trained deep hash distillation (DHD) module t
Peeking inside the Black-Box: Reinforcement Learning for Explainable and Accurate Relation Extraction
Xinyu Guo, Zhengliang Shi, Minglai Yang, Mahdi Rahimi, Mihai Surdeanu · Oct 7, 2025

Finally, human evaluation shows that our best model generates relational keywords closely aligned with gold labels, increasing human explanation quality ratings by 54% (relative).
Early Multimodal Prediction of Cross-Lingual Meme Virality on Reddit: A Time-Window Analysis
Sedat Dogan, Nina Dethlefs, Debarati Chakraborty · Oct 7, 2025

We benchmark interpretable baselines (XGBoost, MLP) against end-to-end deep models (BERT, InceptionV3, CLIP) across early observation windows from 30 to 420 minutes.
Improving Discrete Diffusion Unmasking Policies Beyond Explicit Reference Policies
Chunsan Hong, Seonho An, Min-Soo Kim, Jong Chul Ye · Oct 7, 2025

Empirically, across four benchmarks, our learned policy consistently outperforms max-confidence: for example, on SUDOKU, where unmasking order is critical, it yields a 20.1% gain over random and a 11.2% gain over max-confidence.
AgentDR: Dynamic Recommendation with Implicit Item-Item Relations via LLM-based Agents
Mingdai Yang, Nurendra Choudhary, Jiangshu Du, Edward W. Huang, Philip S. Yu · Oct 7, 2025

Recent agent-based recommendation frameworks aim to simulate user behaviors by incorporating memory mechanisms and prompting strategies, but they struggle with hallucinating non-existent items and full-catalog ranking.
Revisiting Self-Play Preference Optimization: On the Role of Prompt Difficulty
Yao Xiao, Jung-jae Kim, Roy Ka-wei Lee, Lidong Bing · Oct 7, 2025

Pairwise Preference

Self-play preference optimization has emerged as a prominent paradigm for aligning large language models (LLMs).
Slm-mux: Orchestrating small language models for reasoning
Chenyu Wang, Zishen Wan, Hao Kang, Emma Chen, Zhiqiang Xie · Oct 6, 2025

Additional experiments show that the core principle of SLM-MUX extends to open-ended generation tasks (e.g., HumanEval) and benefits other model classes, including frontier LLMs and domain-specific fine-tuned SLMs.
SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests
Punya Syon Pandey, Hai Son Le, Devansh Bhardwaj, Rada Mihalcea, Zhijing Jin · Oct 6, 2025

Critique Edit

Yet, existing safety benchmarks rarely test vulnerabilities in domains such as political manipulation, propaganda and disinformation generation, or surveillance and information control.
Multilingual Routing in Mixture-of-Experts
Lucas Bandarkar, Chenyuan Yang, Mohsen Fayyaz, Junlin Hu, Nanyun Peng · Oct 6, 2025

These 1-2% gains are remarkably consistent across two evaluation tasks, three models, and 15+ languages, especially given that these simple interventions override routers of extensively trained, state-of-the-art LLMs.

Recent Daily Archives

fortnight-2026-f04 (335) week-2026-w08 (288) quarter-2026-q1 (732) month-2026-02 (682) fortnight-2026-f05 (323) week-2026-w09 (323) 2026-02-23 (56) 2026-02-25 (89) 2026-02-24 (118) month-2026-01 (50) 2026-02-20 (37) week-2026-w07 (47) 2026-02-26 (60) quarter-2025-q4 (124) 2026-02-18 (56) 2026-02-21 (22) 2026-02-17 (54) 2026-02-19 (60) fortnight-2026-f02 (27) 2026-02-22 (23) 2026-02-16 (36) month-2025-10 (69) quarter-2025-q2 (78) fortnight-2026-f01 (16) fortnight-2026-f03 (34) quarter-2025-q3 (90) week-2026-w06 (23) month-2025-12 (30) fortnight-2025-f22 (32) quarter-2025-q1 (35) month-2025-06 (39) month-2025-09 (44) month-2025-11 (25) week-2026-w03 (15) week-2026-w04 (12) 2026-02-13 (8) fortnight-2025-f21 (32) 2026-02-15 (7) fortnight-2025-f20 (34) fortnight-2025-f12 (29)

HFEPX Weekly Archive: 2025-W41

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Papers Published On This Date

Recent Daily Archives