HFEPX Archive Slice

HFEPX Weekly Archive: 2025-W51

Updated from current HFEPX corpus (Apr 17, 2026). 44 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 17, 2026). 44 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Multi Dim Rubric. Frequent quality control: Calibration. Frequently cited benchmark: BrowseComp. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Dec 21, 2025.

Papers: 44 Last published: Dec 21, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

High-Signal Coverage

100.0%

44 / 44 papers are not low-signal flagged.

Benchmark Anchors

9.1%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

27.3%

Papers with reported metric mentions in extraction output.

1 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Get this digest every Friday →

Why This Time Slice Matters

9.1% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 25% of papers in this hub.
BrowseComp is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is rater calibration (2.3% of papers).
Rater context is mostly domain experts, and annotation is commonly multi-dimensional rubrics; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Towards Efficient Agents: A Co-Design of Inference Architecture and System
Dec 20, 2025 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Accuracy, Latency
Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning
Dec 18, 2025 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: Exact match
Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics
Dec 18, 2025 · Citations: 0 · Score: 5.0

Eval: Llm As Judge · Metrics: Not reported
Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skills
Dec 18, 2025 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: Cost
Knowledge Distillation with Structured Chain-of-Thought for Text-to-SQL
Dec 18, 2025 · Citations: 0 · Score: 4.0

Eval: Automatic Metrics · Metrics: Cost
In-Context Algebra
Dec 18, 2025 · Citations: 0 · Score: 4.0

Eval: Automatic Metrics · Metrics: Accuracy

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Towards Efficient Agents: A Co-Design of Inference Architecture and System Dec 20, 2025	Automatic Metrics	BrowseComp	Accuracy, Latency	Not reported
Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning Dec 18, 2025	Automatic Metrics	Not reported	Exact match	Not reported
Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics Dec 18, 2025	Llm As Judge	Jailbreakbench	Not reported	Not reported
Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skills Dec 18, 2025	Automatic Metrics	Not reported	Cost	Not reported
Knowledge Distillation with Structured Chain-of-Thought for Text-to-SQL Dec 18, 2025	Automatic Metrics	Not reported	Cost	Not reported
In-Context Algebra Dec 18, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
TTP: Test-Time Padding for Adversarial Detection and Robust Adaptation on Vision-Language Models Dec 18, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
A Domain-Adapted Pipeline for Structured Information Extraction from Police Incident Announcements on Social Media Dec 18, 2025	Automatic Metrics	Not reported	Accuracy, Exact match	Not reported
Imitation Game: Reproducing Deep Learning Bugs Leveraging an Intelligent Agent Dec 17, 2025	Automatic Metrics	Not reported	Success rate	Not reported
Dual-objective Language Models: Training Efficiency Without Overfitting Dec 16, 2025	Automatic Metrics	Not reported	Cost	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (9.1% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (2.3% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (11.4% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (25% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (4.5% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (6.8% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 2.3% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (4.5% coverage).
Annotation unit is under-specified (6.8% coverage).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (BrowseComp vs CSyMR-Bench) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: BrowseComp Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 2.3% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (4.5% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (11)
Llm As Judge (1)

Top Metrics

Accuracy (7)
Cost (4)
Exact match (1)
Latency (1)

Top Benchmarks

BrowseComp (1)
CSyMR Bench (1)
DROP (1)
IFEval (1)

Quality Controls

Calibration (1)

Papers In This Archive Slice

From Word to World: Can Large Language Models be Implicit Text-based World Models?
Yixia Li, Hongru Wang, Jiahao Qiu, Zhenfei Yin, Dongdong Zhang · Dec 21, 2025 · Citations: 0

Long Horizon

Agentic reinforcement learning increasingly relies on experience-driven scaling, yet real-world environments remain non-adaptive, limited in coverage, and difficult to scale.
Adaptive Accountability in Networked MAS: Tracing and Mitigating Emergent Norms at Scale
Saad Alqithami · Dec 21, 2025 · Citations: 0
NASTaR: NovaSAR Automated Ship Target Recognition Dataset
Benyamin Hosseiny, Kamirul Kamirul, Odysseas Pappas, Alin Achim · Dec 20, 2025 · Citations: 0
Exploration vs. Fixation: Scaffolding Divergent and Convergent Thinking for Human-AI Co-Creation with Generative Models
Chao Wen, Tung Phung, Pronita Mehrotra, Sumit Gulwani, Roger E. Beaty · Dec 20, 2025 · Citations: 0

We examine an approach grounded in the Geneplore model of creative cognition and instantiate it in a human-AI co-creation system, HAICo, for creative image generation.
Towards Efficient Agents: A Co-Design of Inference Architecture and System
Weizhe Lin, Hui-Ling Zhen, Shuai Yang, Xian Wang, Renxi Liu · Dec 20, 2025 · Citations: 0

Long Horizon

The rapid development of large language model (LLM)-based agents has unlocked new possibilities for autonomous multi-turn reasoning and tool-augmented decision-making.
DEER: A Benchmark for Evaluating Deep Research Agents on Expert Report Generation
Janghoon Han, Heegyu Kim, Changho Lee, Dahm Lee, Min Hyung Park · Dec 19, 2025 · Citations: 0

Rubric RatingExpert Verification Long Horizon

However, evaluating such reports remains challenging: report quality is multifaceted, making it difficult to determine what to assess and by what criteria; LLM-based judges may miss errors that require domain expertise to identify; and…
Affect, Body, Cognition, Demographics, and Emotion: The ABCDE of Text Features for Computational Affective Science
Jan Philip Wahle, Krishnapriya Vishnubhotla, Bela Gipp, Saif M. Mohammad · Dec 19, 2025 · Citations: 0

ABCDE facilitates interdisciplinary research across numerous fields, including affective science, cognitive science, the digital humanities, sociology, political science, and computational linguistics.
RadImageNet-VQA: A Large-Scale CT and MRI Dataset for Radiologic Visual Question Answering
Léo Butsanets, Charles Corbière, Julien Khlaut, Pierre Manceron, Corentin Dancette · Dec 19, 2025 · Citations: 0

The full dataset and benchmark are publicly available at https://huggingface.co/datasets/raidium/RadImageNet-VQA.
Value Under Ignorance in Universal Artificial Intelligence
Cole Wyeth, Marcus Hutter · Dec 18, 2025 · Citations: 0

We generalize the AIXI reinforcement learning agent to admit a wider class of utility functions.
Knowledge Distillation with Structured Chain-of-Thought for Text-to-SQL
Khushboo Thaker, Yony Bresler · Dec 18, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
SFBD-OMNI: Bridge models for lossy measurement restoration with limited clean samples
Haoye Lu, Yaoliang Yu, Darren Lo · Dec 18, 2025 · Citations: 0
Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning
Qihao Liu, Luoxin Ye, Wufei Ma, Yu-Cheng Chou, Alan Yuille · Dec 18, 2025 · Citations: 0

Pairwise Preference

Across various mathematical benchmarks, the method delivers consistent gains over strong baselines with standard RL post-training.
In-Context Algebra
Eric Todd, Jannik Brinkmann, Rohit Gandikota, David Bau · Dec 18, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics
Iker García-Ferrero, David Montero, Roman Orus · Dec 18, 2025 · Citations: 0

Red Team

We replace fragile pattern-based refusal detection with an LLM-as-a-judge that assigns refusal confidence scores and we propose a ridge-regularized variant to compute steering vectors that better isolate the refusal--compliance direction.
TTP: Test-Time Padding for Adversarial Detection and Robust Adaptation on Vision-Language Models
Zhiwei Li, Yitian Pang, Weining Wang, Zhenan Sun, Qi Li · Dec 18, 2025 · Citations: 0

Vision-Language Models (VLMs), such as CLIP, have achieved impressive zero-shot recognition performance but remain highly susceptible to adversarial perturbations, posing significant risks in safety-critical scenarios.
Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs
Sara Papi, Javier Garcia Gilabert, Zachary Hopton, Vilém Zouhar, Carlos Escolano · Dec 18, 2025 · Citations: 0

We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 6 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems that couple leading speech foundation models (SFM), with multilingual…
Pretrained battery transformer (PBT): A foundation model for universal battery life prediction
Ruifeng Tan, Weixiang Hong, Jia Li, Jiaqiang Huang, Tong-Yi Zhang · Dec 18, 2025 · Citations: 0
Agent Tools Orchestration Leaks More: Dataset, Benchmark, and Mitigation
Yuxuan Qiao, Dongqin Liu, Hongchang Yang, Wei Zhou, Songlin Hu · Dec 18, 2025 · Citations: 0

Evaluation of six state-of-the-art LLMs reveals pervasive risk: the average Overall Leakage Rate reaches 62.11% with an H-Score of only 52.90%.
Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skills
Pengcheng Jiang, Jiacheng Lin, Zhiyi Shi, Zifeng Wang, Luxi He · Dec 18, 2025 · Citations: 0

Pairwise Preference Tool Use

Large language model (LLM) agents are moving beyond prompting alone.
A Domain-Adapted Pipeline for Structured Information Extraction from Police Incident Announcements on Social Media
Mengfan Shen, Kangqi Song, Xindi Wang, Wei Jia, Tao Wang · Dec 18, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Social Story Frames: Contextual Reasoning about Narrative Intent and Reception
Joel Mire, Maria Antoniak, Steven R. Wilson, Zexin Ma, Achyutarama R. Ganti · Dec 17, 2025 · Citations: 0
Stepwise Think-Critique: A Unified Framework for Robust and Interpretable LLM Reasoning
Jiaqi Xu, Cuiling Lan, Xuejin Chen, Yan Lu · Dec 17, 2025 · Citations: 0
Learning continuous state of charge dependent thermal decomposition kinetics for Li-ion cathodes using Kolmogorov-Arnold Chemical Reaction Neural Networks (KA-CRNNs)
Benjamin C. Koenig, Sili Deng · Dec 17, 2025 · Citations: 0
Physics-driven human-like working memory outperforms digital networks in dynamic vision
Jingli Liu, Huannan Zheng, Bohao Zou, Kezhou Yang · Dec 17, 2025 · Citations: 0
Enhancing Tree Species Classification: Insights from YOLOv8 and Explainable AI Applied to TLS Point Cloud Projections
Adrian Straker, Paul Magdon, Marco Zullich, Maximilian Freudenberg, Christoph Kleinn · Dec 17, 2025 · Citations: 0
The Moralization Corpus: Frame-Based Annotation and Analysis of Moralizing Speech Acts across Diverse Text Genres
Maria Becker, Mirko Sommer, Lars Tapken, Yi Wan Teh, Bruno Brocai · Dec 17, 2025 · Citations: 0

Moralizations are pragmatically complex and often implicit, posing significant challenges for both human annotators and NLP systems.
MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers
Xuanjun Zong, Zhiqi Shen, Lei Wang, Yunshi Lan, Chao Yang · Dec 17, 2025 · Citations: 0
Imitation Game: Reproducing Deep Learning Bugs Leveraging an Intelligent Agent
Mehil B Shah, Mohammad Masudur Rahman, Foutse Khomh · Dec 17, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs
Jun Zhang, Teng Wang, Yuying Ge, Yixiao Ge, Xinhao Li · Dec 16, 2025 · Citations: 0

We first expose critical quality issues in existing VTG benchmarks and introduce TimeLens-Bench, comprising meticulously re-annotated versions of three popular benchmarks with strict quality criteria.
A Multicenter Benchmark of Multiple Instance Learning Models for Lymphoma Subtyping from HE-stained Whole Slide Images
Rao Muhammad Umer, Daniel Sens, Jonathan Noll, Sohom Dey, Christian Matek · Dec 16, 2025 · Citations: 0

Deep learning methods could assist pathologists by extracting diagnostic information from routinely available HE-stained slides directly, yet comprehensive benchmarks for lymphoma subtyping on multicenter data are lacking.
Dual-objective Language Models: Training Efficiency Without Overfitting
David Samuel, Lucas Georges Gabriel Charpentier · Dec 16, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
CSyMR: Benchmarking Compositional Music Information Retrieval in Symbolic Music Reasoning
Boyang Wang, Yash Vishe, Xin Xu, Zachary Novack, Xunyi Jiang · Dec 16, 2025 · Citations: 0
GRAFT: Grid-Aware Load Forecasting with Multi-Source Textual Alignment and Fusion
Fangzhou Lin, Guoshun He, Zhenyu Guo, Zhe Huang, Jinsong Tao · Dec 16, 2025 · Citations: 0
RePo: Language Models with Context Re-Positioning
Huayang Li, Tianyu Zhao, Deng Cai, Richard Sproat · Dec 16, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
A Systematic Analysis of Biases in Large Language Models
Xulang Zhang, Rui Mao, Erik Cambria · Dec 16, 2025 · Citations: 0
Olmo 3
Team Olmo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl · Dec 15, 2025 · Citations: 0
Towards Interactive Intelligence for Digital Humans
Yiyi Cai, Xuangeng Chu, Xiwei Gao, Sitong Gong, Yifei Huang · Dec 15, 2025 · Citations: 0

We introduce Interactive Intelligence, a novel paradigm of digital human that is capable of personality-aligned expression, adaptive interaction, and self-evolution.
ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding
Jia-Nan Li, Jian Guan, Wei Wu, Chongxuan Li · Dec 15, 2025 · Citations: 0

Extensive experiments on seven diverse benchmarks show that ReFusion not only overwhelmingly surpasses prior MDMs with a 34\% performance gain and an over 18\times speedup on average, but also bridges the performance gap to strong ARMs…
NRR-Core: Non-Resolution Reasoning as a Computational Framework for Contextual Identity and Ambiguity Preservation
Kei Saito · Dec 15, 2025 · Citations: 0

In the narrow non-evaluative read adopted later in the series, the practical point is not that no judgment ever occurs, but that retained alternatives need not be implemented as repeated full branchwise comparative evaluation during…
On the Effectiveness of Membership Inference in Targeted Data Extraction from Large Language Models
Ali Al Sahili, Ali Chehab, Razane Tajeddine · Dec 15, 2025 · Citations: 0
What Makes an Ideal Quote? Recommending "Unexpected yet Rational" Quotations via Novelty
Bowei Zhang, Jin Xiao, Guanglei Yue, Qianyu He, Yanghua Xiao · Dec 15, 2025 · Citations: 0

A generative label agent first interprets each quotation and its surrounding context into multi-dimensional deep-meaning labels, enabling label-enhanced retrieval.
GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training
Tong Wei, Yijun Yang, Changhao Zhang, Junliang Xing, Yuanchun Shi · Dec 15, 2025 · Citations: 0
Understanding Structured Financial Data with LLMs: A Case Study on Fraud Detection
Xuwei Tan, Yao Ma, Xueru Zhang · Dec 15, 2025 · Citations: 0

Detecting fraud in financial transactions typically relies on tabular models that demand heavy feature engineering to handle high-dimensional data and offer limited interpretability, making it difficult for humans to understand predictions.
Revisiting the Reliability of Language Models in Instruction-Following
Jianshuo Dong, Yutong Zhang, Yan Liu, Zhenyu Zhong, Tao Wei · Dec 15, 2025 · Citations: 0

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now