HFEPX Archive Slice

HFEPX Daily Archive: 2026-02-02

Updated from current HFEPX corpus (Apr 12, 2026). 26 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 26 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Adjudication. Frequently cited benchmark: HellaSwag. Common metric signal: relevance. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 2, 2026.

Papers: 26 Last published: Feb 2, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

High-Signal Coverage

100.0%

26 / 26 papers are not low-signal flagged.

Benchmark Anchors

7.7%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

34.6%

Papers with reported metric mentions in extraction output.

1 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Get this digest every Friday →

Why This Time Slice Matters

7.7% of papers report explicit human-feedback signals, led by demonstration data.
automatic metrics appears in 34.6% of papers in this hub.
HellaSwag is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is adjudication (3.8% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Stratify by benchmark (HellaSwag vs Vdr-Bench) before comparing methods.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models
Feb 2, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Not reported
COMI: Coarse-to-fine Context Compression via Marginal Information Gain
Feb 2, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Exact match, Relevance
WAXAL: A Large-Scale Multilingual African Language Speech Corpus
Feb 2, 2026 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Jailbreak success rate
From Sycophancy to Sensemaking: Premise Governance for Human-AI Decision Making
Feb 2, 2026 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Cost
Proof-RM: A Scalable and Generalizable Reward Model for Math Proof
Feb 2, 2026 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Accuracy
CryoLVM: Self-supervised Learning from Cryo-EM Density Maps with Large Vision Models
Feb 2, 2026 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Throughput

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models Feb 2, 2026	Automatic Metrics	Vdr Bench	Not reported	Adjudication
COMI: Coarse-to-fine Context Compression via Marginal Information Gain Feb 2, 2026	Automatic Metrics	NQ, HotpotQA	Exact match, Relevance	Not reported
WAXAL: A Large-Scale Multilingual African Language Speech Corpus Feb 2, 2026	Automatic Metrics	Not reported	Jailbreak success rate	Not reported
From Sycophancy to Sensemaking: Premise Governance for Human-AI Decision Making Feb 2, 2026	Automatic Metrics	Not reported	Cost	Not reported
Proof-RM: A Scalable and Generalizable Reward Model for Math Proof Feb 2, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
CryoLVM: Self-supervised Learning from Cryo-EM Density Maps with Large Vision Models Feb 2, 2026	Automatic Metrics	Not reported	Throughput	Not reported
AXE: Low-Cost Cross-Domain Web Structured Information Extraction Feb 2, 2026	Automatic Metrics	Not reported	F1, Cost	Not reported
Mechanistic Indicators of Steering Effectiveness in Large Language Models Feb 2, 2026	Automatic Metrics	Not reported	Agreement	Not reported
Restoring Exploration after Post-Training: Latent Exploration Decoding for Large Reasoning Models Feb 2, 2026	Automatic Metrics	Not reported	Accuracy, Pass@1	Not reported
Out of the Memory Barrier: A Highly Memory Efficient Training System for LLMs with Million-Token Contexts Feb 2, 2026	Not reported	Not reported	Latency	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (7.7% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (3.8% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (7.7% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (19.2% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (7.7% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (7.7% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 3.8% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (7.7% coverage).
Annotation unit is under-specified (7.7% coverage).

Suggested Next Analyses

Stratify by benchmark (HellaSwag vs Vdr-Bench) before comparing methods.
Track metric sensitivity by reporting both relevance and accuracy.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

Benchmark Slice: HellaSwag Metric Slice: relevance Recent High-Signal Papers

Known Limitations

Only 3.8% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (7.7% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (9)
Simulation Env (1)

Top Metrics

Relevance (2)
Accuracy (1)
Agreement (1)
Latency (1)

Top Benchmarks

HellaSwag (1)
Vdr Bench (1)

Quality Controls

Adjudication (1)

Papers In This Archive Slice

Moving On, Even When You're Broken: Fail-Active Trajectory Generation via Diffusion Policies Conditioned on Embodiment and Task
Gilberto G. Briscoe-Martinez, Yaashia Gautam, Rahul Shetty, Anuj Pasricha, Marco M. Nicotra · Feb 2, 2026 · Citations: 0
WAXAL: A Large-Scale Multilingual African Language Speech Corpus
Abdoulaye Diack, Perry Nelson, Kwaku Agbesi, Angela Nakalembe, MohamedElfatih MohamedKhair · Feb 2, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
From Sycophancy to Sensemaking: Premise Governance for Human-AI Decision Making
Raunak Jain · Feb 2, 2026 · Citations: 0

We argue reliable human-AI partnership requires a shift from answer generation to collaborative premise governance over a knowledge substrate, negotiating only what is decision-critical.
Proof-RM: A Scalable and Generalizable Reward Model for Math Proof
Haotong Yang, Zitong Wang, Shijia Kang, Siqi Yang, Wenkai Yu · Feb 2, 2026 · Citations: 0

In this work, we design a *scalable* data-construction pipeline that, with minimal human effort, leverages LLMs to generate a large quantity of high-quality ``**question-proof-check**'' triplet data.
VQ-Style: Disentangling Style and Content in Motion with Residual Quantized Representations
Fatemeh Zargarbashi, Dhruv Agrawal, Jakob Buhmann, Martin Guay, Stelian Coros · Feb 2, 2026 · Citations: 0

Human motion data is inherently rich and complex, containing both semantic content and subtle stylistic features that are challenging to model.
Language Steering for Multilingual In-Context Learning
Neeraja Kirtane, Kuan-Hao Huang · Feb 2, 2026 · Citations: 0

Demonstrations

We propose language vectors, computed as the mean activation difference between parallel source and target language examples at a particular layer, and added as an offset to hidden states at inference time to shift the model's internal…
Hallucination or Creativity: How to Evaluate AI-Generated Scientific Stories?
Alex Argese, Pasquale Lisena, Raphaël Troncy · Feb 2, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models
Yu Zeng, Wenxuan Huang, Zhen Fang, Shuang Chen, Yufan Shen · Feb 2, 2026 · Citations: 0

Expert Verification Web Browsing

However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations.
DCoPilot: Generative AI-Empowered Policy Adaptation for Dynamic Data Center Operations
Minghao Li, Ruihang Wang, Rui Tan, Yonggang Wen · Feb 2, 2026 · Citations: 0

However, manually designing piecewise deep reinforcement learning (DRL) agents cannot keep pace with frequent dynamics shifts and service-level agreement (SLA) changes of an evolving DC.
Out of the Memory Barrier: A Highly Memory Efficient Training System for LLMs with Million-Token Contexts
Wenhao Li, Daohai Yu, Gen Luo, Yuxin Zhang, Fei Chao · Feb 2, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
LEC-KG: An LLM-Embedding Collaborative Framework for Domain-Specific Knowledge Graph Construction -- A Case Study on SDGs
Yikai Zeng, Yingchao Piao, Changhua Pei, Jianhui Li · Feb 2, 2026 · Citations: 0
CryoLVM: Self-supervised Learning from Cryo-EM Density Maps with Large Vision Models
Weining Fu, Kai Shu, Kui Xu, Qiangfeng Cliff Zhang · Feb 2, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Rethinking the Role of Entropy in Optimizing Tool-Use Behaviors for Large Language Model Agents
Zeping Li, Hongru Wang, Yiwen Zhao, Guanhua Chen, Yixia Li · Feb 2, 2026 · Citations: 0
Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation
Zhanghao Hu, Qinglin Zhu, Hanqi Yan, Yulan He, Lin Gui · Feb 2, 2026 · Citations: 0

Agent memory systems often adopt the standard Retrieval-Augmented Generation (RAG) pipeline, yet its underlying assumptions differ in this setting.
Towards Exploratory and Focused Manipulation with Bimanual Active Perception: A New Problem, Benchmark and Strategy
Yuxin He, Ruihao Zhang, Tianao Shen, Cheng Liu, Qiang Nie · Feb 2, 2026 · Citations: 0
Behavioral Consistency Validation for LLM Agents: An Analysis of Trading-Style Switching through Stock-Market Simulation
Zeping Li, Guancheng Wan, Keyang Chen, Yu Chen, Yiwen Zhao · Feb 2, 2026 · Citations: 0
Read As Human: Compressing Context via Parallelizable Close Reading and Skimming
Jiwei Tang, Shilei Liu, Zhicheng Zhang, Qingsong Lv, Runsong Zhao · Feb 2, 2026 · Citations: 0
AXE: Low-Cost Cross-Domain Web Structured Information Extraction
Abdelrahman Mansour, Khaled W. Alshaer, Moataz Elsaban · Feb 2, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Controlling Exploration-Exploitation in GFlowNets via Markov Chain Perspectives
Lin Chen, Samuel Drapeau, Fanghao Shao, Xuekai Zhu, Bo Xue · Feb 2, 2026 · Citations: 0

Across various benchmarks, including Set, Bit Sequence, and Molecule Generation, α-GFN objectives consistently outperform previous GFlowNet objectives, achieving up to a 10 \times increase in the number of discovered modes.
COMI: Coarse-to-fine Context Compression via Marginal Information Gain
Jiwei Tang, Shilei Liu, Zhicheng Zhang, Yujin Yuan, Libin Zheng · Feb 2, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Mechanistic Indicators of Steering Effectiveness in Large Language Models
Mehdi Jafari, Hao Xue, Flora Salim · Feb 2, 2026 · Citations: 0

Despite its widespread use, the mechanistic factors that govern when steering succeeds or fails remain poorly understood, as prior work has relied primarily on black-box outputs or LLM-based judges.
Restoring Exploration after Post-Training: Latent Exploration Decoding for Large Reasoning Models
Wenhui Tan, Fiorenzo Parascandolo, Enver Sangineto, Jianzhong Ju, Zhenbo Luo · Feb 2, 2026 · Citations: 0

Without additional training or parameters, LED consistently improves pass@1 and pass@16 accuracy by 0.61 and 1.03 percentage points across multiple reasoning benchmarks and models.
Adaptive Rollout Allocation for Online Reinforcement Learning with Verifiable Rewards
Hieu Trung Nguyen, Bao Nguyen, Wenao Ma, Yuzhi Zhao, Ruifeng She · Feb 2, 2026 · Citations: 0

Empirical results show that VIP consistently improves sampling efficiency and achieves higher performance than uniform or heuristic allocation strategies in multiple benchmarks.
Argument Rarity-based Originality Assessment for AI-Assisted Writing
Keito Inoshita, Michiaki Omura, Tsukasa Yamanaka, Go Maeda, Kentaro Tsuji · Feb 2, 2026 · Citations: 0

Experiments using 1,375 human essays and 1,000 AI-generated essays on two argumentative topics revealed three key findings.
InfoTok: Information-Theoretic Regularization for Capacity-Constrained Shared Visual Tokenization in Unified MLLMs
Lv Tang, Tianyi Zheng, Bo Li, Xingyu Li · Feb 2, 2026 · Citations: 0
Making Bias Non-Predictive: Training Robust LLM Reasoning via Reinforcement Learning
Qian Wang, Xuandong Zhao, Zirui Zhang, Zhanzhi Lou, Nuo Chen · Feb 2, 2026 · Citations: 0

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote