HFEPX Archive Slice

HFEPX Daily Archive: 2026-02-11

Updated from current HFEPX corpus (Apr 12, 2026). 19 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 19 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Calibration. Frequently cited benchmark: agent-diff-bench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 11, 2026.

Papers: 19 Last published: Feb 11, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Medium .

High-Signal Coverage

100.0%

19 / 19 papers are not low-signal flagged.

Benchmark Anchors

21.1%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

26.3%

Papers with reported metric mentions in extraction output.

2 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Get this digest every Friday →

Why This Time Slice Matters

10.5% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 10.5% of papers in this hub.
agent-diff-bench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is rater calibration (5.3% of papers).
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.
Stratify by benchmark (agent-diff-bench vs BrowseComp) before comparing methods.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation
Feb 11, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Task success
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters
Feb 11, 2026 · Citations: 0 · Score: 6.0

Eval: Not reported · Metrics: Latency, Cost
Advancing AI Trustworthiness Through Patient Simulation: Risk Assessment of Conversational Agents for Antidepressant Selection
Feb 11, 2026 · Citations: 0 · Score: 5.5

Eval: Simulation Env · Metrics: Kappa, Agreement
The Automatic Verification of Image-Text Claims (AVerImaTeC) Shared Task
Feb 11, 2026 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Accuracy
Voxtral Realtime
Feb 11, 2026 · Citations: 0 · Score: 3.5

Eval: Not reported · Metrics: Latency
DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning
Feb 11, 2026 · Citations: 0 · Score: 3.5

Eval: Not reported · Metrics: Not reported

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation Feb 11, 2026	Automatic Metrics	Agent Diff Bench	Task success	Not reported
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters Feb 11, 2026	Not reported	LiveCodeBench, BrowseComp	Latency, Cost	Not reported
Advancing AI Trustworthiness Through Patient Simulation: Risk Assessment of Conversational Agents for Antidepressant Selection Feb 11, 2026	Simulation Env	Not reported	Kappa, Agreement	Inter Annotator Agreement Reported
The Automatic Verification of Image-Text Claims (AVerImaTeC) Shared Task Feb 11, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
Voxtral Realtime Feb 11, 2026	Not reported	Not reported	Latency	Not reported
DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning Feb 11, 2026	Not reported	AIME	Not reported	Not reported
Search or Accelerate: Confidence-Switched Position Beam Search for Diffusion Language Models Feb 11, 2026	Not reported	GSM8K, HumanEval+	Not reported	Not reported
Learning Page Order in Shuffled WOO Releases Feb 11, 2026	Not reported	Not reported	Not reported	Not reported
TestExplora: Benchmarking LLMs for Proactive Bug Discovery via Repository-Level Test Generation Feb 11, 2026	Not reported	Not reported	Not reported	Calibration
When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing Feb 11, 2026	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (10.5% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (10.5% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (21.1% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (36.8% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (15.8% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (5.3% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 10.5% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (15.8% coverage).
Annotation unit is under-specified (5.3% coverage).

Suggested Next Analyses

Stratify by benchmark (agent-diff-bench vs BrowseComp) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.

Recommended Queries

Benchmark Slice: agent-diff-bench Metric Slice: accuracy IAA-Reported Evaluations Recent High-Signal Papers

Known Limitations

Only 10.5% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (15.8% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (2)
Simulation Env (1)

Top Metrics

Accuracy (4)
Cost (1)
Latency (1)
Relevance (1)

Top Benchmarks

Agent Diff Bench (1)
BrowseComp (1)
GraphBench (1)
Imo Answerbench (1)

Quality Controls

Calibration (1)
Inter Annotator Agreement Reported (1)

Papers In This Archive Slice

Advancing AI Trustworthiness Through Patient Simulation: Risk Assessment of Conversational Agents for Antidepressant Selection
Md Tanvir Rouf Shawon, Mohammad Sabik Irbaz, Hadeel R. A. Elyazori, Keerti Reddy Resapu, Yili Lin · Feb 11, 2026 · Citations: 0

Objective: This paper introduces a patient simulator for scalable, automated evaluation of healthcare conversational agents, generating realistic, controllable interactions that systematically vary across medical, linguistic, and behavioral…
When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing
Zachary Pedram Dadfar · Feb 11, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Voxtral Realtime
Mistral-AI, :, Alexander H. Liu, Andy Ehrenberg, Andy Lo · Feb 11, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning
Yicheng Chen, Zerun Ma, Xinchen Xie, Yining Li, Kai Chen · Feb 11, 2026 · Citations: 0

Despite the growing use of LLMs to automate individual data processing steps, such as data synthesis and filtering, the overall design of data recipes remains largely manual and labor-intensive, requiring substantial human expertise and…
GraphSeek: Next-Generation Graph Analytics with LLMs
Maciej Besta, Łukasz Jarmocik, Orest Hrycyna, Shachar Klaiman, Konrad Mączka · Feb 11, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Embedding Inversion via Conditional Masked Diffusion Language Models
Han Xiao · Feb 11, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Learning Page Order in Shuffled WOO Releases
Efe Kahraman, Giulio Tosato · Feb 11, 2026 · Citations: 0

Pairwise Preference

We observe two unexpected failures: seq2seq transformers fail to generalize on long documents (Kendall's tau drops from 0.918 on 2-5 pages to 0.014 on 21-25 pages), and curriculum learning underperforms direct training by 39% on long…
When Fusion Helps and When It Breaks: View-Aligned Robustness in Same-Source Financial Imaging
Rui Ma · Feb 11, 2026 · Citations: 0

To control label ambiguity from near-zero moves, we use an ex-post minimum-movement threshold min_move (tau) based on realized absolute next-day return, defining an offline benchmark on the subset where the absolute next-day return is at…
LoRA-Squeeze: Simple and Effective Post-Tuning and In-Tuning Compression of LoRA Modules
Ivan Vulić, Adam Grycner, Quentin de Laroussilhe, Jonas Pfeiffer · Feb 11, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Search or Accelerate: Confidence-Switched Position Beam Search for Diffusion Language Models
Mingyu Cao, Alvaro H. C. Correia, Christos Louizos, Shiwei Liu, Lu Yin · Feb 11, 2026 · Citations: 0

Across mathematical reasoning and code generation benchmarks (GSM8K, MBPP, HumanEval) on Dream-7B and LLaDA-8B, SOAR improves generation quality while maintaining competitive inference speed, offering a practical way to balance quality and…
Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation
Hubert M. Pysklo, Artem Zhuravel, Patrick D. Watson · Feb 11, 2026 · Citations: 0

Tool Use

We present Agent-Diff, a novel benchmarking framework for evaluating agentic Large Language Models (LLMs) on real-world productivity software API tasks via code execution.
Understand Then Memory: A Cognitive Gist-Driven RAG Framework with Global Semantic Diffusion
Pengcheng Zhou, Haochen Li, Zhiqiang Nie, JiaLe Chen, Qing Gong · Feb 11, 2026 · Citations: 0
The Automatic Verification of Image-Text Claims (AVerImaTeC) Shared Task
Rui Cao, Zhenyun Deng, Yulong Chen, Michael Schlichtkrull, Andreas Vlachos · Feb 11, 2026 · Citations: 0

Web Browsing

The winning team, HUMANE, achieved an AVerImaTeC score of 0.5455.
To Think or Not To Think, That is The Question for Large Reasoning Models in Theory of Mind Tasks
Nanxu Gong, Haotian Li, Sixun Dong, Jianxun Lian, Yanjie Fu · Feb 11, 2026 · Citations: 0
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters
Ailin Huang, Ang Li, Aobo Kong, Bin Wang, Binxing Jiao · Feb 11, 2026 · Citations: 0

Pairwise Preference Tool Use

We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency.
Learning Adaptive Distribution Alignment with Neural Characteristic Function for Graph Domain Adaptation
Wei Chen, Xingyu Guo, Shuang Li, Zhao Zhang, Yan Zhong · Feb 11, 2026 · Citations: 0
Neuro-Symbolic Synergy for Interactive World Modeling
Hongyu Zhao, Siyu Zhou, Haolin Yang, Zengyi Qin, Tianyi Zhou · Feb 11, 2026 · Citations: 0
TestExplora: Benchmarking LLMs for Proactive Bug Discovery via Repository-Level Test Generation
Steven Liu, Jane Luo, Xin Zhang, Aofan Liu, Hao Liu · Feb 11, 2026 · Citations: 0

To bridge this gap, we present TestExplora, a benchmark designed to evaluate LLMs as proactive testers within full-scale, realistic repository environments.
When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents
Virginie Mouilleron, Théo Lasnier, Anna Mosolova, Djamé Seddah · Feb 11, 2026 · Citations: 0

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote