Daily Archive

HFEPX Monthly Archive: 2025-02

Updated from current HFEPX corpus (Feb 27, 2026). 16 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Freeform. Frequent quality control: Calibration. Frequently cited benchmark: GSM8K. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 28, 2025.

Papers: 16 Last published: Feb 28, 2025 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 16 papers for HFEPX Monthly Archive: 2025-02. Dominant protocol signals include automatic metrics, with frequent benchmark focus on GSM8K, MMLU and metric focus on accuracy, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

31.3% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence , Moving Beyond Medical Exams: A Clinician-Annotated Fairness Dataset of Real-World Tasks and Ambiguity in Mental Healthcare , Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks , The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
automatic metrics appears in 100% of papers in this hub.

Evidence: Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks , The Mighty ToRR: A Benchmark for Table Reasoning and Robustness , Compressing Language Models for Specialized Domains , PII-Bench: Evaluating Query-Aware Privacy Protection Systems
GSM8K is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task , Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks , The Mighty ToRR: A Benchmark for Table Reasoning and Robustness , Compressing Language Models for Specialized Domains

Protocol Takeaways

Most common quality-control signal is rater calibration (6.3% of papers).

Evidence: Compressing Language Models for Specialized Domains , Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks , The Mighty ToRR: A Benchmark for Table Reasoning and Robustness , PII-Bench: Evaluating Query-Aware Privacy Protection Systems
Rater context is mostly domain experts, and annotation is commonly Freeform; use this to scope replication staffing.

Evidence: Moving Beyond Medical Exams: A Clinician-Annotated Fairness Dataset of Real-World Tasks and Ambiguity in Mental Healthcare , SEFL: A Framework for Generating Synthetic Educational Assignment Feedback with LLM Agents , MoEMba: A Mamba-based Mixture of Experts for High-Density EMG-based Hand Gesture Recognition , Oracular Programming: A Modular Foundation for Building LLM-Enabled Software
Stratify by benchmark (GSM8K vs MMLU) before comparing methods.

Evidence: Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks , The Mighty ToRR: A Benchmark for Table Reasoning and Robustness , Compressing Language Models for Specialized Domains , PII-Bench: Evaluating Query-Aware Privacy Protection Systems

Benchmark Interpretation

GSM8K appears in 6.3% of hub papers (1/16); use this cohort for benchmark-matched comparisons.
MMLU appears in 6.3% of hub papers (1/16); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 18.8% of hub papers (3/16); compare with a secondary metric before ranking methods.
cost is reported in 12.5% of hub papers (2/16); compare with a secondary metric before ranking methods.

Researcher Checklist

Tighten coverage on Papers with explicit human feedback. Coverage is usable but incomplete (31.3% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (6.3% vs 30% target).
Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (25% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (43.8% vs 35% target).
Tighten coverage on Papers with known rater population. Coverage is usable but incomplete (25% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (18.8% vs 35% target).

Papers with explicit human feedback

Coverage is usable but incomplete (31.3% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (6.3% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (25% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (43.8% vs 35% target).

Papers with known rater population

Coverage is usable but incomplete (25% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (18.8% vs 35% target).

Known Limitations

Only 6.3% of papers report quality controls; prioritize calibration/adjudication evidence.
Annotation unit is under-specified (18.8% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: GSM8K - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

Benchmark Brief

GSM8K

Coverage: 1 papers (6.3%)

1 papers (6.3%) mention GSM8K.

Examples: MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task

Benchmark Brief

MMLU

Coverage: 1 papers (6.3%)

1 papers (6.3%) mention MMLU.

Examples: Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

Benchmark Brief

Pii-Bench

Coverage: 1 papers (6.3%)

1 papers (6.3%) mention Pii-Bench.

Examples: PII-Bench: Evaluating Query-Aware Privacy Protection Systems

Metric Brief

accuracy

Coverage: 3 papers (18.8%)

3 papers (18.8%) mention accuracy.

Examples: Moving Beyond Medical Exams: A Clinician-Annotated Fairness Dataset of Real-World Tasks and Ambiguity in Mental Healthcare , Glycemic-Aware and Architecture-Agnostic Training Framework for Blood Glucose Forecasting in Type 1 Diabetes , MoEMba: A Mamba-based Mixture of Experts for High-Density EMG-based Hand Gesture Recognition

Metric Brief

cost

Coverage: 2 papers (12.5%)

2 papers (12.5%) mention cost.

Examples: Compressing Language Models for Specialized Domains , vCache: Verified Semantic Prompt Caching

Metric Brief

error rate

Coverage: 2 papers (12.5%)

2 papers (12.5%) mention error rate.

Examples: Glycemic-Aware and Architecture-Agnostic Training Framework for Blood Glucose Forecasting in Type 1 Diabetes , vCache: Verified Semantic Prompt Caching

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks , The Mighty ToRR: A Benchmark for Table Reasoning and Robustness , Compressing Language Models for Specialized Domains

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Papers Published On This Date

Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks
Hanjiang Hu, Alexander Robey, Changliu Liu · Feb 28, 2025

Red Team

To address this challenge, we propose a safety steering framework grounded in safe control theory, ensuring invariant safety in multi-turn dialogues.
The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Shir Ashury-Tahan, Yifan Mai, Rajmohan C, Ariel Gera, Yotam Perlitz · Feb 26, 2025

To address this gap, we create ToRR, a benchmark for Table Reasoning and Robustness, measuring model performance and robustness on table-related tasks.
Compressing Language Models for Specialized Domains
Miles Williams, George Chrysostomou, Vitor Jeronymo, Nikolaos Aletras · Feb 25, 2025

Compression techniques such as pruning and quantization offer a practical path towards efficient LM deployment, exemplified by their ability to preserve performance on general-purpose benchmarks.
PII-Bench: Evaluating Query-Aware Privacy Protection Systems
Hao Shen, Zhouhong Gu, Haokai Hong, Weili Han · Feb 25, 2025

To address this challenge, we propose a query-unrelated PII masking strategy and introduce PII-Bench, the first comprehensive evaluation framework for assessing privacy protection systems.
Can Multimodal LLMs Perform Time Series Anomaly Detection?
Xiongxiao Xu, Haoran Wang, Yueqing Liang, Philip S. Yu, Yue Zhao · Feb 25, 2025

Multi Agent

One natural way for humans to detect time series anomalies is through visualization and textual description.
Bridging Gaps in Natural Language Processing for Yorùbá: A Systematic Review of a Decade of Progress and Prospects
Toheeb Aduramomi Jimoh, Tabea De Wille, Nikola S. Nikolov · Feb 24, 2025

Natural Language Processing (NLP) is becoming a dominant subset of artificial intelligence as the need to help machines understand human language looks indispensable.
Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence
Wenzhe Yin, Zehao Xiao, Pan Zhou, Shujian Yu, Jiayi Shen · Feb 24, 2025

Pairwise Preference

Vision-language alignment is crucial for various downstream tasks such as cross-modal generation and retrieval.
Moving Beyond Medical Exams: A Clinician-Annotated Fairness Dataset of Real-World Tasks and Ambiguity in Mental Healthcare
Max Lamparth, Declan Grabb, Amy Franks, Scott Gershan, Kaitlyn N. Kunstman · Feb 22, 2025

Pairwise PreferenceExpert Verification

Current medical language model (LM) benchmarks often over-simplify the complexities of day-to-day clinical practice tasks and instead rely on evaluating LMs on multiple-choice board exam questions.
Glycemic-Aware and Architecture-Agnostic Training Framework for Blood Glucose Forecasting in Type 1 Diabetes
Saman Khamesian, Asiful Arefeen, Maria Adela Grando, Bithika M. Thompson, Hassan Ghasemzadeh · Feb 20, 2025

Managing Type 1 Diabetes (T1D) demands constant vigilance as individuals strive to regulate their blood glucose levels and avoid dysglycemia, including hyperglycemia and hypoglycemia.
SEFL: A Framework for Generating Synthetic Educational Assignment Feedback with LLM Agents
Mike Zhang, Amalie Pernille Dilling, Léon Gondelman, Niels Erik Ruan Lyngdorf, Euan D. Lindsay · Feb 18, 2025

Critique Edit

Through comprehensive evaluations with three LLM judges and three human experts, across a subset of 900 outputs, we demonstrate that SEFL-tuned models outperform both their untuned counterparts and an existing baseline in terms of feedback
Using the Path of Least Resistance to Explain Deep Networks
Sina Salek, Joseph Enguehard · Feb 17, 2025

Through experiments on both synthetic and real-world image classification data, we provide empirical evidence supporting our theoretical analysis and showing that GIG produces more faithful attributions than existing methods, including IG,
MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task
Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Xin Xu · Feb 17, 2025

Through comprehensive experiments on multiple mathematical reasoning datasets, including MathInstruct, MetaMathQA and etc., we demonstrate that models trained on MathFimer-expanded data consistently outperform their counterparts trained on
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
Bettina Messmer, Vinko Sabolčec, Martin Jaggi · Feb 14, 2025

Training a 1B-parameter Llama model for 70B and 119B tokens, our approach can match the baseline MMLU score with as little as 15% of the training tokens, while also improving across other benchmarks and mitigating the curse of multilinguali
MoEMba: A Mamba-based Mixture of Experts for High-Density EMG-based Hand Gesture Recognition
Mehran Shabanpour, Kasra Rad, Sadaf Khademi, Arash Mohammadi · Feb 9, 2025

High-Density surface Electromyography (HDsEMG) has emerged as a pivotal resource for Human-Computer Interaction (HCI), offering direct insights into muscle activities and motion intentions.
Oracular Programming: A Modular Foundation for Building LLM-Enabled Software
Jonathan Laurent, André Platzer · Feb 7, 2025

Demonstrations Web Browsing

Large Language Models can solve a wide range of tasks from just a few examples, but they remain difficult to steer and lack a capability essential for building reliable software at scale: the modular composition of computations under enforc
vCache: Verified Semantic Prompt Caching
Luis Gaspar Schroeder, Aditya Desai, Alejandro Cuadron, Kyle Chu, Shu Liu · Feb 6, 2025

We release the vCache implementation and four benchmarks to support future research.

Recent Daily Archives

fortnight-2026-f04 (335) week-2026-w08 (288) quarter-2026-q1 (732) month-2026-02 (682) fortnight-2026-f05 (323) week-2026-w09 (323) 2026-02-23 (56) 2026-02-25 (89) 2026-02-24 (118) month-2026-01 (50) 2026-02-20 (37) week-2026-w07 (47) 2026-02-26 (60) quarter-2025-q4 (124) 2026-02-18 (56) 2026-02-21 (22) 2026-02-17 (54) 2026-02-19 (60) fortnight-2026-f02 (27) 2026-02-22 (23) 2026-02-16 (36) month-2025-10 (69) quarter-2025-q2 (78) fortnight-2026-f01 (16) fortnight-2026-f03 (34) quarter-2025-q3 (90) week-2026-w06 (23) month-2025-12 (30) fortnight-2025-f22 (32) quarter-2025-q1 (35) month-2025-06 (39) month-2025-09 (44) month-2025-11 (25) week-2026-w03 (15) week-2026-w04 (12) 2026-02-13 (8) fortnight-2025-f21 (32) 2026-02-15 (7) fortnight-2025-f20 (34) fortnight-2025-f12 (29)

HFEPX Monthly Archive: 2025-02

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Papers Published On This Date

Recent Daily Archives