Daily Archive

HFEPX Monthly Archive: 2025-12

Updated from current HFEPX corpus (Feb 27, 2026). 30 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: BIRD. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Dec 31, 2025.

Papers: 30 Last published: Dec 31, 2025 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 30 papers for HFEPX Monthly Archive: 2025-12. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on BIRD, BrowseComp and metric focus on accuracy, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

13.3% of papers report explicit human-feedback signals, led by demonstration data.

Evidence: RAIR: A Rule-Aware Benchmark Uniting Challenging Long-Tail and Visual Salience Subset for E-commerce Relevance Assessment , WISE: Web Information Satire and Fakeness Evaluation , Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss , CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics
automatic metrics appears in 90% of papers in this hub.

Evidence: RAIR: A Rule-Aware Benchmark Uniting Challenging Long-Tail and Visual Salience Subset for E-commerce Relevance Assessment , WISE: Web Information Satire and Fakeness Evaluation , Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss , CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics
BIRD is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: RAIR: A Rule-Aware Benchmark Uniting Challenging Long-Tail and Visual Salience Subset for E-commerce Relevance Assessment , WISE: Web Information Satire and Fakeness Evaluation , Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss , CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics

Protocol Takeaways

Most common quality-control signal is rater calibration (3.3% of papers).

Evidence: WISE: Web Information Satire and Fakeness Evaluation , CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics , RAIR: A Rule-Aware Benchmark Uniting Challenging Long-Tail and Visual Salience Subset for E-commerce Relevance Assessment , Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.

Evidence: Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss , CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics , RAIR: A Rule-Aware Benchmark Uniting Challenging Long-Tail and Visual Salience Subset for E-commerce Relevance Assessment , WISE: Web Information Satire and Fakeness Evaluation
Stratify by benchmark (BIRD vs BrowseComp) before comparing methods.

Evidence: RAIR: A Rule-Aware Benchmark Uniting Challenging Long-Tail and Visual Salience Subset for E-commerce Relevance Assessment , WISE: Web Information Satire and Fakeness Evaluation , Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss , CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics

Benchmark Interpretation

BIRD appears in 3.3% of hub papers (1/30); use this cohort for benchmark-matched comparisons.
BrowseComp appears in 3.3% of hub papers (1/30); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 33.3% of hub papers (10/30); compare with a secondary metric before ranking methods.
cost is reported in 23.3% of hub papers (7/30); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (13.3% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (6.7% vs 30% target).
Close gap on Papers naming benchmarks/datasets. Coverage is a replication risk (10% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (63.3% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (6.7% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (3.3% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (13.3% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (6.7% vs 30% target).

Papers naming benchmarks/datasets

Coverage is a replication risk (10% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (63.3% vs 35% target).

Papers with known rater population

Coverage is a replication risk (6.7% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (3.3% vs 35% target).

Known Limitations

Only 6.7% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (6.7% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: BIRD - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

automatic_metrics vs simulation_env

both=1, left_only=26, right_only=3

1 papers use both Automatic Metrics and Simulation Env.

Benchmark Brief

BIRD

Coverage: 1 papers (3.3%)

1 papers (3.3%) mention BIRD.

Examples: CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics

Benchmark Brief

BrowseComp

Coverage: 1 papers (3.3%)

1 papers (3.3%) mention BrowseComp.

Examples: Towards Efficient Agents: A Co-Design of Inference Architecture and System

Benchmark Brief

Cricbench

Coverage: 1 papers (3.3%)

1 papers (3.3%) mention Cricbench.

Examples: CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics

Metric Brief

accuracy

Coverage: 10 papers (33.3%)

10 papers (33.3%) mention accuracy.

Examples: WISE: Web Information Satire and Fakeness Evaluation , CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics , DIAL: Direct Iterative Adversarial Learning for Realistic Multi-Turn Dialogue Simulation

Metric Brief

cost

Coverage: 7 papers (23.3%)

7 papers (23.3%) mention cost.

Examples: Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss , DIAL: Direct Iterative Adversarial Learning for Realistic Multi-Turn Dialogue Simulation , Knowledge Distillation with Structured Chain-of-Thought for Text-to-SQL

Metric Brief

perplexity

Coverage: 2 papers (6.7%)

2 papers (6.7%) mention perplexity.

Examples: Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers , Randomized Masked Finetuning: An Efficient Way to Mitigate Memorization of PIIs in LLMs

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: RAIR: A Rule-Aware Benchmark Uniting Challenging Long-Tail and Visual Salience Subset for E-commerce Relevance Assessment , WISE: Web Information Satire and Fakeness Evaluation , Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Papers Published On This Date

RAIR: A Rule-Aware Benchmark Uniting Challenging Long-Tail and Visual Salience Subset for E-commerce Relevance Assessment
Chenji Lu, Zhuo Chen, Hui Zhao, Zhenyi Wang, Pengjie Wang · Dec 31, 2025

While large language models (LLMs) have shown significant results on relevance task, existing benchmarks lack sufficient complexity for comprehensive model assessment, resulting in an absence of standardized relevance evaluation metrics acr
WISE: Web Information Satire and Fakeness Evaluation
Gaurab Chhetri, Subasish Das, Tausif Islam Chowdhury · Dec 30, 2025

This study develops WISE (Web Information Satire and Fakeness Evaluation) framework which benchmarks eight lightweight transformer models alongside two baseline models on a balanced dataset of 20,000 samples from Fakeddit, annotated as eith
Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss
Ang Lv, Jin Ma, Yiyuan Ma, Siyuan Qiao · Dec 29, 2025

Mixture-of-Experts (MoE) models lack explicit constraints to ensure the router's decisions align well with the experts' capabilities, which ultimately limits model performance.
CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics
Vaibhav Devraj, Dhruv Kumar, Jagat Sesh Challa, Parth Agarwal, Navya Kommuri · Dec 26, 2025

Expert Verification

To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data.
Where Did This Sentence Come From? Tracing Provenance in LLM Reasoning Distillation
Kaiyuan Liu, Shaotian Yan, Rui Miao, Bing Wang, Chen Shen · Dec 24, 2025

Reasoning distillation has attracted increasing attention.
DIAL: Direct Iterative Adversarial Learning for Realistic Multi-Turn Dialogue Simulation
Ziyi Zhu, Olivier Tieleman, Caitlin A. Stamatis, Luka Smyth, Thomas D. Hull · Dec 23, 2025

Realistic user simulation is crucial for training and evaluating multi-turn dialogue systems, yet creating simulators that accurately replicate human behavior remains a significant challenge.
On the Existence and Behavior of Secondary Attention Sinks
Jeffrey T. H. Wong, Cheng Zhang, Louis Mahon, Wayne Luk, Anton Isopoussu · Dec 22, 2025

Attention sinks are tokens, often the beginning-of-sequence (BOS) token, that receive disproportionately high attention despite limited semantic relevance.
Stop saying LLM: Large Discourse Models (LDM) and Artificial Discursive Agent (ADA)?
Amar Lakel · Dec 22, 2025

This paper proposes an epistemological shift in the analysis of large generative models, replacing the category ''Large Language Models'' (LLM) with that of ''Large Discourse Models'' (LDM), and then with that of Artificial Discursive Agent
Towards Efficient Agents: A Co-Design of Inference Architecture and System
Weizhe Lin, Hui-Ling Zhen, Shuai Yang, Xian Wang, Renxi Liu · Dec 20, 2025

Long Horizon

The rapid development of large language model (LLM)-based agents has unlocked new possibilities for autonomous multi-turn reasoning and tool-augmented decision-making.
Knowledge Distillation with Structured Chain-of-Thought for Text-to-SQL
Khushboo Thaker, Yony Bresler · Dec 18, 2025

Deploying accurate Text-to-SQL systems at the enterprise level faces a difficult trilemma involving cost, security and performance.
In-Context Algebra
Eric Todd, Jannik Brinkmann, Rohit Gandikota, David Bau · Dec 18, 2025

We investigate the mechanisms that arise when transformers are trained to solve arithmetic on sequences where tokens are variables whose meaning is determined only through their interactions in-context.
Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics
Iker García-Ferrero, David Montero, Roman Orus · Dec 18, 2025

Red Team

We replace fragile pattern-based refusal detection with an LLM-as-a-judge that assigns refusal confidence scores and we propose a ridge-regularized variant to compute steering vectors that better isolate the refusal--compliance direction.
A Domain-Adapted Pipeline for Structured Information Extraction from Police Incident Announcements on Social Media
Mengfan Shen, Kangqi Song, Xindi Wang, Wei Jia, Tao Wang · Dec 18, 2025

Structured information extraction from police incident announcements is crucial for timely and accurate data processing, yet presents considerable challenges due to the variability and informal nature of textual sources such as social media
Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution
Jonathan Kamp, Roos Bakker, Dominique Blok · Dec 11, 2025

Pairwise Preference

In this work, we delve beyond the superficial inconsistencies between attribution methods, structuring their biases through a model- and method-agnostic framework of three evaluation metrics.
Interpreto: An Explainability Library for Transformers
Antonin Poché, Thomas Mullor, Gabriele Sarti, Frédéric Boisnard, Corentin Friedrich · Dec 10, 2025

Interpreto is an open-source Python library for interpreting HuggingFace language models, from early BERT variants to LLMs.
KD-OCT: Efficient Knowledge Distillation for Clinical-Grade Retinal OCT Classification
Erfan Nourbakhsh, Nasrin Sanjari, Ali Nourbakhsh · Dec 9, 2025

Age-related macular degeneration (AMD) and choroidal neovascularization (CNV)-related conditions are leading causes of vision loss worldwide, with optical coherence tomography (OCT) serving as a cornerstone for early detection and managemen
QSTN: A Modular Framework for Robust Questionnaire Inference with Large Language Models
Maximilian Kreutner, Jens Rupprecht, Georg Ahnert, Ahmed Salem, Markus Strohmaier · Dec 9, 2025

QSTN enables robust evaluation of questionnaire presentation, prompt perturbations, and response generation methods.
Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning
Huilin Xu, Zhuoyang Liu, Yixiang Luomei, Feng Xu · Dec 9, 2025

Long Horizon

Extensive experiments on the AerialVLN and OpenFly benchmark validate the effectiveness of our method.
Group Representational Position Encoding
Yifan Zhang, Zixiang Chen, Yifeng Liu, Zhen Qin, Huizhuo Yuan · Dec 8, 2025

We present GRAPE (Group Representational Position Encoding), a unified framework for positional encoding based on group actions.
STaRR: Spatial-Temporal Token-Dynamics-Aware Responsive Remasking for Diffusion Language Models
Xinhao Sun, Huaijin Zhao, Maoliang Li, Zihao Zheng, Jiayu Chen · Dec 7, 2025

Diffusion Language Models (DLMs) enable parallel decoding via iterative denoising, where remasking strategies play a critical role in balancing inference speed and output quality.
Conflict-Aware Fusion: Resolving Logic Inertia in Large Language Models via Structured Cognitive Priors
Qiming Bao, Xiaoxuan Fu, Michael Witbrock · Dec 6, 2025

Long Horizon

We present a controlled evaluation framework consisting of four stress tests: (1) rule deletion (redundant vs.
Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers
Hongzhan Lin, Zhiqi Bai, Xinmiao Zhang, Sen Yang, Xiang Li · Dec 3, 2025

Transformer decoders have achieved strong results across tasks, but the memory required for the KV cache becomes prohibitive at long sequence lengths.
AITutor-EvalKit: Exploring the Capabilities of AI Tutors
Numaan Naeem, Kaushal Kumar Maurya, Kseniia Petukhova, Ekaterina Kochmar · Dec 3, 2025

Demonstrations

We present AITutor-EvalKit, an application that uses language technology to evaluate the pedagogical quality of AI tutors, provides software for demonstration and evaluation, as well as model inspection and data visualization.
Randomized Masked Finetuning: An Efficient Way to Mitigate Memorization of PIIs in LLMs
Kunj Joshi, David A. Smith · Dec 2, 2025

We present MaxTER, a Pareto-optimal evaluation framework for assessing privacy-utility tradeoffs, and show the performance of RMFT vs Deduplication by Area Under The Response Curve (AURC) metric.
Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks
Songwen Zhao, Danqing Wang, Kexun Zhang, Jiaxuan Luo, Zhuo Li · Dec 2, 2025

Vibe coding is a new programming paradigm in which human engineers instruct large language model (LLM) agents to complete complex coding tasks with little supervision.
From Moderation to Mediation: Can LLMs Serve as Mediators in Online Flame Wars?
Dawei Li, Abdullah Alnaibari, Arslan Bisharat, Manny Sandoval, Deborah Hall · Dec 2, 2025

To assess mediation quality, we construct a large Reddit-based dataset and propose a multi-stage evaluation pipeline combining principle-based scoring, user simulation, and human comparison.
promptolution: A Unified, Modular Framework for Prompt Optimization
Tom Zehle, Timo Heiß, Moritz Schlager, Matthias Aßenmacher, Matthias Feurer · Dec 2, 2025

It integrates multiple contemporary discrete prompt optimizers, supports systematic and reproducible benchmarking, and returns framework-agnostic prompt strings, enabling seamless integration into existing LLM pipelines while remaining agno
BOOM: Beyond Only One Modality KIT's Multimodal Multilingual Lecture Companion
Sai Koneru, Fabian Retkowski, Christian Huber, Lukas Hilgert, Seymanur Akti · Dec 2, 2025

The globalization of education and rapid growth of online learning have made localizing educational content a critical challenge.
PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models
Robert Belanec, Ivan Srba, Maria Bielikova · Dec 2, 2025

While its modular design supports extensibility, it natively provides a representative set of 19 PEFT methods, 27 classification and text generation datasets addressing 12 tasks, and both standard and PEFT-specific evaluation metrics.
Cross-Lingual Interleaving for Speech Language Models
Adel Moumen, Guangzhi Sun, Philip C. Woodland · Dec 1, 2025

However, progress has been largely English-centric due to scarce spoken evaluation benchmarks and training data, making cross-lingual learning difficult.

Recent Daily Archives

fortnight-2026-f04 (335) week-2026-w08 (288) quarter-2026-q1 (732) month-2026-02 (682) fortnight-2026-f05 (323) week-2026-w09 (323) 2026-02-23 (56) 2026-02-25 (89) 2026-02-24 (118) month-2026-01 (50) 2026-02-20 (37) week-2026-w07 (47) 2026-02-26 (60) quarter-2025-q4 (124) 2026-02-18 (56) 2026-02-21 (22) 2026-02-17 (54) 2026-02-19 (60) fortnight-2026-f02 (27) 2026-02-22 (23) 2026-02-16 (36) month-2025-10 (69) quarter-2025-q2 (78) fortnight-2026-f01 (16) fortnight-2026-f03 (34) quarter-2025-q3 (90) week-2026-w06 (23) fortnight-2025-f22 (32) quarter-2025-q1 (35) month-2025-06 (39) month-2025-09 (44) month-2025-11 (25) week-2026-w03 (15) week-2026-w04 (12) 2026-02-13 (8) fortnight-2025-f21 (32) 2026-02-15 (7) fortnight-2025-f20 (34) fortnight-2025-f12 (29) week-2025-w39 (21)

HFEPX Monthly Archive: 2025-12

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Papers Published On This Date

Recent Daily Archives