Daily Archive

HFEPX Fortnight Archive: 2025-F25

Updated from current HFEPX corpus (Feb 27, 2026). 17 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Common annotation unit: Trajectory. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Dec 11, 2025.

Papers: 17 Last published: Dec 11, 2025 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 17 papers for HFEPX Fortnight Archive: 2025-F25. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on multiple benchmark families and metric focus on accuracy, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

11.8% of papers report explicit human-feedback signals, led by demonstration data.

Evidence: AITutor-EvalKit: Exploring the Capabilities of AI Tutors , Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution , Interpreto: An Explainability Library for Transformers , KD-OCT: Efficient Knowledge Distillation for Clinical-Grade Retinal OCT Classification
automatic metrics appears in 82.4% of papers in this hub.

Evidence: Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution , Interpreto: An Explainability Library for Transformers , KD-OCT: Efficient Knowledge Distillation for Clinical-Grade Retinal OCT Classification , QSTN: A Modular Framework for Robust Questionnaire Inference with Large Language Models
long-horizon tasks appears in 11.8% of papers, indicating agentic evaluation demand.

Evidence: Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning , Conflict-Aware Fusion: Resolving Logic Inertia in Large Language Models via Structured Cognitive Priors , Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution , Interpreto: An Explainability Library for Transformers

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.

Evidence: Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution , Interpreto: An Explainability Library for Transformers , KD-OCT: Efficient Knowledge Distillation for Clinical-Grade Retinal OCT Classification , QSTN: A Modular Framework for Robust Questionnaire Inference with Large Language Models
Rater context is mostly unspecified rater pools, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.

Evidence: Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution , Interpreto: An Explainability Library for Transformers , KD-OCT: Efficient Knowledge Distillation for Clinical-Grade Retinal OCT Classification , QSTN: A Modular Framework for Robust Questionnaire Inference with Large Language Models
Track metric sensitivity by reporting both accuracy and cost.

Evidence: Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution , Interpreto: An Explainability Library for Transformers , KD-OCT: Efficient Knowledge Distillation for Clinical-Grade Retinal OCT Classification , QSTN: A Modular Framework for Robust Questionnaire Inference with Large Language Models

Metric Interpretation

accuracy is reported in 23.5% of hub papers (4/17); compare with a secondary metric before ranking methods.
cost is reported in 23.5% of hub papers (4/17); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (11.8% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (0% vs 30% target).
Close gap on Papers naming benchmarks/datasets. Coverage is a replication risk (0% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (52.9% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (0% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (5.9% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (11.8% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is a replication risk (0% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (52.9% vs 35% target).

Papers with known rater population

Coverage is a replication risk (0% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (5.9% vs 35% target).

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

automatic_metrics vs simulation_env

both=0, left_only=14, right_only=3

0 papers use both Automatic Metrics and Simulation Env.

Metric Brief

accuracy

Coverage: 4 papers (23.5%)

4 papers (23.5%) mention accuracy.

Examples: KD-OCT: Efficient Knowledge Distillation for Clinical-Grade Retinal OCT Classification , STaRR: Spatial-Temporal Token-Dynamics-Aware Responsive Remasking for Diffusion Language Models , Conflict-Aware Fusion: Resolving Logic Inertia in Large Language Models via Structured Cognitive Priors

Metric Brief

cost

Coverage: 4 papers (23.5%)

4 papers (23.5%) mention cost.

Examples: QSTN: A Modular Framework for Robust Questionnaire Inference with Large Language Models , Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning , Group Representational Position Encoding

Metric Brief

perplexity

Coverage: 2 papers (11.8%)

2 papers (11.8%) mention perplexity.

Examples: Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers , Randomized Masked Finetuning: An Efficient Way to Mitigate Memorization of PIIs in LLMs

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution , Interpreto: An Explainability Library for Transformers , KD-OCT: Efficient Knowledge Distillation for Clinical-Grade Retinal OCT Classification

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Papers Published On This Date

Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution
Jonathan Kamp, Roos Bakker, Dominique Blok · Dec 11, 2025

Pairwise Preference

In this work, we delve beyond the superficial inconsistencies between attribution methods, structuring their biases through a model- and method-agnostic framework of three evaluation metrics.
Interpreto: An Explainability Library for Transformers
Antonin Poché, Thomas Mullor, Gabriele Sarti, Frédéric Boisnard, Corentin Friedrich · Dec 10, 2025

Interpreto is an open-source Python library for interpreting HuggingFace language models, from early BERT variants to LLMs.
KD-OCT: Efficient Knowledge Distillation for Clinical-Grade Retinal OCT Classification
Erfan Nourbakhsh, Nasrin Sanjari, Ali Nourbakhsh · Dec 9, 2025

Age-related macular degeneration (AMD) and choroidal neovascularization (CNV)-related conditions are leading causes of vision loss worldwide, with optical coherence tomography (OCT) serving as a cornerstone for early detection and managemen
QSTN: A Modular Framework for Robust Questionnaire Inference with Large Language Models
Maximilian Kreutner, Jens Rupprecht, Georg Ahnert, Ahmed Salem, Markus Strohmaier · Dec 9, 2025

QSTN enables robust evaluation of questionnaire presentation, prompt perturbations, and response generation methods.
Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning
Huilin Xu, Zhuoyang Liu, Yixiang Luomei, Feng Xu · Dec 9, 2025

Long Horizon

Extensive experiments on the AerialVLN and OpenFly benchmark validate the effectiveness of our method.
Group Representational Position Encoding
Yifan Zhang, Zixiang Chen, Yifeng Liu, Zhen Qin, Huizhuo Yuan · Dec 8, 2025

We present GRAPE (Group Representational Position Encoding), a unified framework for positional encoding based on group actions.
STaRR: Spatial-Temporal Token-Dynamics-Aware Responsive Remasking for Diffusion Language Models
Xinhao Sun, Huaijin Zhao, Maoliang Li, Zihao Zheng, Jiayu Chen · Dec 7, 2025

Diffusion Language Models (DLMs) enable parallel decoding via iterative denoising, where remasking strategies play a critical role in balancing inference speed and output quality.
Conflict-Aware Fusion: Resolving Logic Inertia in Large Language Models via Structured Cognitive Priors
Qiming Bao, Xiaoxuan Fu, Michael Witbrock · Dec 6, 2025

Long Horizon

We present a controlled evaluation framework consisting of four stress tests: (1) rule deletion (redundant vs.
Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers
Hongzhan Lin, Zhiqi Bai, Xinmiao Zhang, Sen Yang, Xiang Li · Dec 3, 2025

Transformer decoders have achieved strong results across tasks, but the memory required for the KV cache becomes prohibitive at long sequence lengths.
AITutor-EvalKit: Exploring the Capabilities of AI Tutors
Numaan Naeem, Kaushal Kumar Maurya, Kseniia Petukhova, Ekaterina Kochmar · Dec 3, 2025

Demonstrations

We present AITutor-EvalKit, an application that uses language technology to evaluate the pedagogical quality of AI tutors, provides software for demonstration and evaluation, as well as model inspection and data visualization.
Randomized Masked Finetuning: An Efficient Way to Mitigate Memorization of PIIs in LLMs
Kunj Joshi, David A. Smith · Dec 2, 2025

We present MaxTER, a Pareto-optimal evaluation framework for assessing privacy-utility tradeoffs, and show the performance of RMFT vs Deduplication by Area Under The Response Curve (AURC) metric.
Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks
Songwen Zhao, Danqing Wang, Kexun Zhang, Jiaxuan Luo, Zhuo Li · Dec 2, 2025

Vibe coding is a new programming paradigm in which human engineers instruct large language model (LLM) agents to complete complex coding tasks with little supervision.
From Moderation to Mediation: Can LLMs Serve as Mediators in Online Flame Wars?
Dawei Li, Abdullah Alnaibari, Arslan Bisharat, Manny Sandoval, Deborah Hall · Dec 2, 2025

To assess mediation quality, we construct a large Reddit-based dataset and propose a multi-stage evaluation pipeline combining principle-based scoring, user simulation, and human comparison.
promptolution: A Unified, Modular Framework for Prompt Optimization
Tom Zehle, Timo Heiß, Moritz Schlager, Matthias Aßenmacher, Matthias Feurer · Dec 2, 2025

It integrates multiple contemporary discrete prompt optimizers, supports systematic and reproducible benchmarking, and returns framework-agnostic prompt strings, enabling seamless integration into existing LLM pipelines while remaining agno
BOOM: Beyond Only One Modality KIT's Multimodal Multilingual Lecture Companion
Sai Koneru, Fabian Retkowski, Christian Huber, Lukas Hilgert, Seymanur Akti · Dec 2, 2025

The globalization of education and rapid growth of online learning have made localizing educational content a critical challenge.
PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models
Robert Belanec, Ivan Srba, Maria Bielikova · Dec 2, 2025

While its modular design supports extensibility, it natively provides a representative set of 19 PEFT methods, 27 classification and text generation datasets addressing 12 tasks, and both standard and PEFT-specific evaluation metrics.
Cross-Lingual Interleaving for Speech Language Models
Adel Moumen, Guangzhi Sun, Philip C. Woodland · Dec 1, 2025

However, progress has been largely English-centric due to scarce spoken evaluation benchmarks and training data, making cross-lingual learning difficult.

Recent Daily Archives

fortnight-2026-f04 (335) week-2026-w08 (288) quarter-2026-q1 (732) month-2026-02 (682) fortnight-2026-f05 (323) week-2026-w09 (323) 2026-02-23 (56) 2026-02-25 (89) 2026-02-24 (118) month-2026-01 (50) 2026-02-20 (37) week-2026-w07 (47) 2026-02-26 (60) quarter-2025-q4 (124) 2026-02-18 (56) 2026-02-21 (22) 2026-02-17 (54) 2026-02-19 (60) fortnight-2026-f02 (27) 2026-02-22 (23) 2026-02-16 (36) month-2025-10 (69) quarter-2025-q2 (78) fortnight-2026-f01 (16) fortnight-2026-f03 (34) quarter-2025-q3 (90) week-2026-w06 (23) month-2025-12 (30) fortnight-2025-f22 (32) quarter-2025-q1 (35) month-2025-06 (39) month-2025-09 (44) month-2025-11 (25) week-2026-w03 (15) week-2026-w04 (12) 2026-02-13 (8) fortnight-2025-f21 (32) 2026-02-15 (7) fortnight-2025-f20 (34) fortnight-2025-f12 (29)

HFEPX Fortnight Archive: 2025-F25

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Papers Published On This Date

Recent Daily Archives