HFEPX Archive Slice

HFEPX Weekly Archive: 2025-W49

Updated from current HFEPX corpus (Mar 8, 2026). 19 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Mar 8, 2026). 19 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Scalar. Frequently cited benchmark: GSM8K. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Dec 7, 2025.

Papers: 19 Last published: Dec 7, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Medium .

High-Signal Coverage

100.0%

19 / 19 papers are not low-signal flagged.

Benchmark Anchors

5.3%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

42.1%

Papers with reported metric mentions in extraction output.

0 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Why This Time Slice Matters

15.8% of papers report explicit human-feedback signals, led by demonstration data.
automatic metrics appears in 36.8% of papers in this hub.
GSM8K is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly scalar scoring; use this to scope replication staffing.
Stratify by benchmark (GSM8K vs Longmemeval) before comparing methods.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs
Dec 3, 2025 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Cost
Diffusion Model in Latent Space for Medical Image Segmentation Task
Dec 1, 2025 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: Mse
STaRR: Spatial-Temporal Token-Dynamics-Aware Responsive Remasking for Diffusion Language Models
Dec 7, 2025 · Citations: 0 · Score: 4.0

Eval: Automatic Metrics · Metrics: Accuracy
Conflict-Aware Fusion: Resolving Logic Inertia in Large Language Models via Structured Cognitive Priors
Dec 6, 2025 · Citations: 0 · Score: 4.0

Eval: Automatic Metrics · Metrics: Accuracy
ArtistMus: A Globally Diverse, Artist-Centric Benchmark for Retrieval-Augmented Music Question Answering
Dec 5, 2025 · Citations: 0 · Score: 4.0

Eval: Automatic Metrics · Metrics: Accuracy, Recall
Randomized Masked Finetuning: An Efficient Way to Mitigate Memorization of PIIs in LLMs
Dec 2, 2025 · Citations: 0 · Score: 4.0

Eval: Automatic Metrics · Metrics: Perplexity

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs Dec 3, 2025	Automatic Metrics	MATH 500, GSM8K	Cost	Not reported
Diffusion Model in Latent Space for Medical Image Segmentation Task Dec 1, 2025	Automatic Metrics	Not reported	Mse	Not reported
STaRR: Spatial-Temporal Token-Dynamics-Aware Responsive Remasking for Diffusion Language Models Dec 7, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
Conflict-Aware Fusion: Resolving Logic Inertia in Large Language Models via Structured Cognitive Priors Dec 6, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
ArtistMus: A Globally Diverse, Artist-Centric Benchmark for Retrieval-Augmented Music Question Answering Dec 5, 2025	Automatic Metrics	Not reported	Accuracy, Recall	Not reported
Randomized Masked Finetuning: An Efficient Way to Mitigate Memorization of PIIs in LLMs Dec 2, 2025	Automatic Metrics	Not reported	Perplexity	Not reported
Cross-Lingual Interleaving for Speech Language Models Dec 1, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers Dec 3, 2025	Not reported	Not reported	Perplexity, Cost	Not reported
Think-While-Generating: On-the-Fly Reasoning for Personalized Long-Form Generation Dec 7, 2025	Not reported	Not reported	Not reported	Not reported
AITutor-EvalKit: Exploring the Capabilities of AI Tutors Dec 3, 2025	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (15.8% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (5.3% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (26.3% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (10.5% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (5.3% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (10.5% coverage).
Annotation unit is under-specified (5.3% coverage).

Suggested Next Analyses

Stratify by benchmark (GSM8K vs Longmemeval) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.

Recommended Queries

Benchmark Slice: GSM8K Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (10.5% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (7)
Simulation Env (1)

Top Metrics

Accuracy (3)
Cost (2)
Dice (1)
Iou (1)

Top Benchmarks

GSM8K (1)
Longmemeval (1)
MATH 500 (1)

Quality Controls

Papers In This Archive Slice

STaRR: Spatial-Temporal Token-Dynamics-Aware Responsive Remasking for Diffusion Language Models
Xinhao Sun, Huaijin Zhao, Maoliang Li, Zihao Zheng, Jiayu Chen · Dec 7, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Think-While-Generating: On-the-Fly Reasoning for Personalized Long-Form Generation
Chengbing Wang, Yang Zhang, Wenjie Wang, Xiaoyan Zhao, Fuli Feng · Dec 7, 2025 · Citations: 0

Pairwise Preference

Preference alignment has enabled large language models (LLMs) to better reflect human expectations, but current methods mostly optimize for population-level preferences, overlooking individual users.
Towards Small Language Models for Security Query Generation in SOC Workflows
Saleha Muzammil, Rahul Reddy, Vishal Kamalakrishnan, Hadi Ahmadi, Wajih Ul Hassan · Dec 7, 2025 · Citations: 0
Conflict-Aware Fusion: Resolving Logic Inertia in Large Language Models via Structured Cognitive Priors
Qiming Bao, Xiaoxuan Fu, Michael Witbrock · Dec 6, 2025 · Citations: 0

Long Horizon

We present a controlled evaluation framework consisting of four stress tests: (1) rule deletion (redundant vs.
Automated Data Enrichment using Confidence-Aware Fine-Grained Debate among Open-Source LLMs for Mental Health and Online Safety
Junyu Mao, Anthony Hills, Talia Tseriotou, Maria Liakata, Aya Shamir · Dec 6, 2025 · Citations: 0

Real-world indicators play an important role in many natural language processing (NLP) applications, such as life-event for mental health analysis and risky behaviour for online safety, yet labelling such information in training datasets is…
ArtistMus: A Globally Diverse, Artist-Centric Benchmark for Retrieval-Augmented Music Question Answering
Daeyong Kwon, SeungHeon Doh, Juhan Nam · Dec 5, 2025 · Citations: 0

We introduce MusWikiDB, a vector database of 3.2M passages from 144K music-related Wikipedia pages, and ArtistMus, a benchmark of 1,000 questions on 500 diverse artists with metadata such as genre, debut year, and topic.
Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers
Hongzhan Lin, Zhiqi Bai, Xinmiao Zhang, Sen Yang, Xiang Li · Dec 3, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition
Zichuan Lin, Yicheng Liu, Yang Yang, Lvfang Tao, Deheng Ye · Dec 3, 2025 · Citations: 0
AITutor-EvalKit: Exploring the Capabilities of AI Tutors
Numaan Naeem, Kaushal Kumar Maurya, Kseniia Petukhova, Ekaterina Kochmar · Dec 3, 2025 · Citations: 0

Demonstrations

We present AITutor-EvalKit, an application that uses language technology to evaluate the pedagogical quality of AI tutors, provides software for demonstration and evaluation, as well as model inspection and data visualization.
Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs
Ngoc Bui, Shubham Sharma, Simran Lamba, Saumitra Mishra, Rex Ying · Dec 3, 2025 · Citations: 0

Long Horizon

Across mathematical reasoning (GSM8K, MATH-500, AIME24), procedural generation (LongProc), conversational long-memory benchmarks (LongMemEval), and long-context understanding (LongBenchV2 and SCBench), TRIM-KV consistently outperforms…
Randomized Masked Finetuning: An Efficient Way to Mitigate Memorization of PIIs in LLMs
Kunj Joshi, David A. Smith · Dec 2, 2025 · Citations: 0

We present MaxTER, a Pareto-optimal evaluation framework for assessing privacy-utility tradeoffs, and show the performance of RMFT vs Deduplication by Area Under The Response Curve (AURC) metric.
Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks
Songwen Zhao, Danqing Wang, Kexun Zhang, Jiaxuan Luo, Zhuo Li · Dec 2, 2025 · Citations: 0

To answer this question, we propose SU S VI B E S, a benchmark consisting of 200 feature-request software engineering tasks from real-world open-source projects, which, when given to human programmers, led to vulnerable implementations.
From Moderation to Mediation: Can LLMs Serve as Mediators in Online Flame Wars?
Dawei Li, Abdullah Alnaibari, Arslan Bisharat, Manny Sandoval, Deborah Hall · Dec 2, 2025 · Citations: 0

To assess mediation quality, we construct a large Reddit-based dataset and propose a multi-stage evaluation pipeline combining principle-based scoring, user simulation, and human comparison.
promptolution: A Unified, Modular Framework for Prompt Optimization
Tom Zehle, Timo Heiß, Moritz Schlager, Matthias Aßenmacher, Matthias Feurer · Dec 2, 2025 · Citations: 0

It integrates multiple contemporary discrete prompt optimizers, supports systematic and reproducible benchmarking, and returns framework-agnostic prompt strings, enabling seamless integration into existing LLM pipelines while remaining…
BOOM: Beyond Only One Modality KIT's Multimodal Multilingual Lecture Companion
Sai Koneru, Fabian Retkowski, Christian Huber, Lukas Hilgert, Seymanur Akti · Dec 2, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models
Robert Belanec, Ivan Srba, Maria Bielikova · Dec 2, 2025 · Citations: 0

While its modular design supports extensibility, it natively provides a representative set of 19 PEFT methods, 27 classification and text generation datasets addressing 12 tasks, and both standard and PEFT-specific evaluation metrics.
Cross-Lingual Interleaving for Speech Language Models
Adel Moumen, Guangzhi Sun, Philip C. Woodland · Dec 1, 2025 · Citations: 0

However, progress has been largely English-centric due to scarce spoken evaluation benchmarks and training data, making cross-lingual learning difficult.
InnoGym: Benchmarking the Innovation Potential of AI Agents
Jintian Zhang, Kewei Xu, Jingsheng Zheng, Zhuoyun Yu, Yuqi Zhu · Dec 1, 2025 · Citations: 0
Diffusion Model in Latent Space for Medical Image Segmentation Task
Huynh Trinh Ngoc, Toan Nguyen Hai, Ba Luong Son, Long Tran Quoc · Dec 1, 2025 · Citations: 0

Expert Verification

Medical image segmentation is crucial for clinical diagnosis and treatment planning.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote