Daily Archive

HFEPX Monthly Archive: 2025-04

Updated from current HFEPX corpus (Feb 27, 2026). 14 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequently cited benchmark: Retrieval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Apr 28, 2025.

Papers: 14 Last published: Apr 28, 2025 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 14 papers for HFEPX Monthly Archive: 2025-04. Dominant protocol signals include automatic metrics, with frequent benchmark focus on Retrieval, MedQA and metric focus on accuracy, precision. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

35.7% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition , Evaluating the Diversity and Quality of LLM Generated Content , Diffusion Generative Recommendation with Continuous Tokens , Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning
automatic metrics appears in 100% of papers in this hub.

Evidence: A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage , Reshaping MOFs text mining with a dynamic multi-agents framework of large language model , Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition , How much does context affect the accuracy of AI health advice?
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition , Diffusion Generative Recommendation with Continuous Tokens , Don't Let It Hallucinate: Premise Verification via Retrieval-Augmented Logical Reasoning , A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.

Evidence: A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage , Reshaping MOFs text mining with a dynamic multi-agents framework of large language model , Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition , How much does context affect the accuracy of AI health advice?
Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.

Evidence: ConformalNL2LTL: Translating Natural Language Instructions into Temporal Logic Formulas with Conformal Correctness Guarantees , Cost-of-Pass: An Economic Framework for Evaluating Language Models , A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage , Reshaping MOFs text mining with a dynamic multi-agents framework of large language model
Stratify by benchmark (Retrieval vs MedQA) before comparing methods.

Evidence: A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage , Reshaping MOFs text mining with a dynamic multi-agents framework of large language model , Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition , How much does context affect the accuracy of AI health advice?

Benchmark Interpretation

Retrieval appears in 28.6% of hub papers (4/14); use this cohort for benchmark-matched comparisons.
MedQA appears in 7.1% of hub papers (1/14); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 35.7% of hub papers (5/14); compare with a secondary metric before ranking methods.
precision is reported in 14.3% of hub papers (2/14); compare with a secondary metric before ranking methods.

Researcher Checklist

Tighten coverage on Papers with explicit human feedback. Coverage is usable but incomplete (35.7% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (0% vs 30% target).
Maintain strength on Papers naming benchmarks/datasets. Coverage is strong (35.7% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (50% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (14.3% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (7.1% vs 35% target).

Papers with explicit human feedback

Coverage is usable but incomplete (35.7% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is strong (35.7% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (50% vs 35% target).

Papers with known rater population

Coverage is a replication risk (14.3% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (7.1% vs 35% target).

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (14.3% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

Benchmark Brief

Retrieval

Coverage: 4 papers (28.6%)

4 papers (28.6%) mention Retrieval.

Examples: Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition , Diffusion Generative Recommendation with Continuous Tokens , Don't Let It Hallucinate: Premise Verification via Retrieval-Augmented Logical Reasoning

Benchmark Brief

MedQA

Coverage: 1 papers (7.1%)

1 papers (7.1%) mention MedQA.

Examples: A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage

Metric Brief

accuracy

Coverage: 5 papers (35.7%)

5 papers (35.7%) mention accuracy.

Examples: Reshaping MOFs text mining with a dynamic multi-agents framework of large language model , How much does context affect the accuracy of AI health advice? , ConformalNL2LTL: Translating Natural Language Instructions into Temporal Logic Formulas with Conformal Correctness Guarantees

Metric Brief

precision

Coverage: 2 papers (14.3%)

2 papers (14.3%) mention precision.

Examples: Reshaping MOFs text mining with a dynamic multi-agents framework of large language model , Pretraining Language Models for Diachronic Linguistic Change Discovery

Metric Brief

coherence

Coverage: 1 papers (7.1%)

1 papers (7.1%) mention coherence.

Examples: Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage , Reshaping MOFs text mining with a dynamic multi-agents framework of large language model , Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Papers Published On This Date

A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage
Rui Xin, Niloofar Mireshghallah, Shuyue Stella Li, Michael Duan, Hyunwoo Kim · Apr 28, 2025

Sanitizing sensitive text data typically involves removing personally identifiable information (PII) or generating synthetic data under the assumption that these methods adequately protect privacy; however, their effectiveness is often only
Reshaping MOFs text mining with a dynamic multi-agents framework of large language model
Zuhong Lin, Daoyuan Ren, Kai Ran, Jing Sun, Songlin Yu · Apr 26, 2025

Multi Agent

Accurately identifying the synthesis conditions of metal-organic frameworks (MOFs) is essential for guiding experimental design, yet remains challenging because relevant information in the literature is often scattered, inconsistent, and di
Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition
Zheng Hui, Xiaokai Wei, Yexi Jiang, Kevin Gao, Chen Wang · Apr 26, 2025

Pairwise Preference Multi Agent

These domains typically involve fixed content and passive consumption, where user preferences can be matched by genre or theme.
How much does context affect the accuracy of AI health advice?
Prashant Garg, Thiemo Fetzer · Apr 25, 2025

English-language performance does not reliably generalise across contexts, underscoring the need for multilingual, domain-specific evaluation before deployment in public-health communication.
FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation
Yulia Otmakhova, Hung Thinh Truong, Rahmad Mahendra, Zenan Zhai, Rongxin Zhu · Apr 24, 2025

We present FLUKE (Framework for LingUistically-driven and tasK-agnostic robustness Evaluation), a framework for assessing model robustness through systematic minimal variations of test data.
ConformalNL2LTL: Translating Natural Language Instructions into Temporal Logic Formulas with Conformal Correctness Guarantees
David Smith Sundarsingh, Jun Wang, Jyotirmoy V. Deshmukh, Yiannis Kantaros · Apr 22, 2025

Linear Temporal Logic (LTL) is a widely used task specification language for autonomous systems.
Cost-of-Pass: An Economic Framework for Evaluating Language Models
Mehmet Hamza Erol, Batu El, Mirac Suzgun, Mert Yuksekgonul, James Zou · Apr 17, 2025

We then define the frontier cost-of-pass: the minimum cost-of-pass achievable across available models or the human-expert(s), using the approx.
Evaluating the Diversity and Quality of LLM Generated Content
Alexander Shypula, Shuo Li, Botong Zhang, Vishakh Padmakumar, Kayo Yin · Apr 16, 2025

Pairwise Preference

Recent work suggests that preference-tuning techniques -- such as Reinforcement Learning from Human Feedback (RLHF) methods like PPO and GRPO, as well as alternatives like DPO -- reduce diversity, creating a dilemma given that these models
Diffusion Generative Recommendation with Continuous Tokens
Haohao Qu, Shanru Lin, Yujuan Ding, Yiqi Wang, Wenqi Fan · Apr 16, 2025

Pairwise Preference

Specifically, ContRec consists of two key modules: a sigma-VAE Tokenizer, which encodes users/items with continuous tokens; and a Dispersive Diffusion module, which captures implicit user preference.
Don't Let It Hallucinate: Premise Verification via Retrieval-Augmented Logical Reasoning
Yuehan Qin, Shawn Li, Yi Nian, Xinyan Velocity Yu, Yue Zhao · Apr 8, 2025

Large language models (LLMs) have shown substantial capacity for generating fluent, contextually appropriate responses.
Pretraining Language Models for Diachronic Linguistic Change Discovery
Elisabeth Fittschen, Sabrina Li, Tom Lippincott, Leshem Choshen, Craig Messner · Apr 7, 2025

This has engendered growing interest in their use in humanistic disciplines, such as historical linguistics and literary studies.
Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models
Yubo Li, Xiaobin Shen, Xinyu Yao, Xueying Ding, Yidi Miao · Apr 7, 2025

Red Team

We organize existing benchmarks and datasets into coherent categories reflecting the evolving landscape of multi-turn dialogue evaluation, and review a broad spectrum of enhancement methodologies, including model-centric strategies (in-cont
Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning
Julian Minder, Clément Dumas, Caden Juang, Bilal Chugtai, Neel Nanda · Apr 3, 2025

Pairwise Preference

Using the BatchTopK crosscoder, we successfully identify a set of chat-specific latents that are both interpretable and causally effective, representing concepts such as $\textit{false information}$ and $\textit{personal question}$, along w
m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning with Large Language Models
Xiaoke Huang, Juncheng Wu, Hui Liu, Xianfeng Tang, Yuyin Zhou · Apr 1, 2025

Our evaluation across diverse medical tasks demonstrates that test-time scaling consistently enhances medical reasoning, enabling lightweight fine-tuned models under 10B parameters to establish new state-of-the-art performance, while our 32

Recent Daily Archives

fortnight-2026-f04 (335) week-2026-w08 (288) quarter-2026-q1 (732) month-2026-02 (682) fortnight-2026-f05 (323) week-2026-w09 (323) 2026-02-23 (56) 2026-02-25 (89) 2026-02-24 (118) month-2026-01 (50) 2026-02-20 (37) week-2026-w07 (47) 2026-02-26 (60) quarter-2025-q4 (124) 2026-02-18 (56) 2026-02-21 (22) 2026-02-17 (54) 2026-02-19 (60) fortnight-2026-f02 (27) 2026-02-22 (23) 2026-02-16 (36) month-2025-10 (69) quarter-2025-q2 (78) fortnight-2026-f01 (16) fortnight-2026-f03 (34) quarter-2025-q3 (90) week-2026-w06 (23) month-2025-12 (30) fortnight-2025-f22 (32) quarter-2025-q1 (35) month-2025-06 (39) month-2025-09 (44) month-2025-11 (25) week-2026-w03 (15) week-2026-w04 (12) 2026-02-13 (8) fortnight-2025-f21 (32) 2026-02-15 (7) fortnight-2025-f20 (34) fortnight-2025-f12 (29)

HFEPX Monthly Archive: 2025-04

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Papers Published On This Date

Recent Daily Archives