HFEPX Archive Slice

HFEPX Daily Archive: 2026-02-09

Updated from current HFEPX corpus (Apr 12, 2026). 13 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 13 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequently cited benchmark: LongBench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 9, 2026.

Papers: 13 Last published: Feb 9, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Medium .

High-Signal Coverage

100.0%

13 / 13 papers are not low-signal flagged.

Benchmark Anchors

23.1%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

38.5%

Papers with reported metric mentions in extraction output.

0 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Get this digest every Friday →

Why This Time Slice Matters

23.1% of papers report explicit human-feedback signals, led by critique/edit feedback.
automatic metrics appears in 38.5% of papers in this hub.
LongBench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Stratify by benchmark (LongBench vs TREC) before comparing methods.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Document Reconstruction Unlocks Scalable Long-Context RLVR
Feb 9, 2026 · Citations: 0 · Score: 7.0

Eval: Automatic Metrics · Metrics: Coherence
Automatic In-Domain Exemplar Construction and LLM-Based Refinement of Multi-LLM Expansions for Query Expansion
Feb 9, 2026 · Citations: 0 · Score: 4.5

Eval: Not reported · Metrics: Not reported
ViGoEmotions: A Benchmark Dataset For Fine-grained Emotion Detection on Vietnamese Texts
Feb 9, 2026 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: F1, F1 macro
Language Modeling and Understanding Through Paraphrase Generation and Detection
Feb 9, 2026 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Accuracy
Pretraining with Token-Level Adaptive Latent Chain-of-Thought
Feb 9, 2026 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Accuracy, Perplexity
Large Language Models and Impossible Language Acquisition: "False Promise" or an Overturn of our Current Perspective towards AI
Feb 9, 2026 · Citations: 0 · Score: 4.0

Eval: Automatic Metrics · Metrics: Not reported

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Document Reconstruction Unlocks Scalable Long-Context RLVR Feb 9, 2026	Automatic Metrics	LongBench	Coherence	Not reported
Automatic In-Domain Exemplar Construction and LLM-Based Refinement of Multi-LLM Expansions for Query Expansion Feb 9, 2026	Not reported	TREC	Not reported	Not reported
ViGoEmotions: A Benchmark Dataset For Fine-grained Emotion Detection on Vietnamese Texts Feb 9, 2026	Automatic Metrics	Not reported	F1, F1 macro	Not reported
Language Modeling and Understanding Through Paraphrase Generation and Detection Feb 9, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
Pretraining with Token-Level Adaptive Latent Chain-of-Thought Feb 9, 2026	Automatic Metrics	Not reported	Accuracy, Perplexity	Not reported
Large Language Models and Impossible Language Acquisition: "False Promise" or an Overturn of our Current Perspective towards AI Feb 9, 2026	Automatic Metrics	Not reported	Not reported	Not reported
UI-Venus-1.5 Technical Report Feb 9, 2026	Not reported	APPS, Venusbench	Not reported	Not reported
Prototype-Based Disentanglement for Controllable Dysarthric Speech Synthesis Feb 9, 2026	Not reported	Not reported	Jailbreak success rate	Not reported
PBLean: Pseudo-Boolean Proof Certificates for Lean 4 Feb 9, 2026	Not reported	Not reported	Not reported	Not reported
Dynamics Within Latent Chain-of-Thought: An Empirical Study of Causal Structure Feb 9, 2026	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (23.1% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (15.4% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (23.1% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (15.4% vs 35% target).
Moderate: Papers with known annotation unit

Coverage is usable but incomplete (23.1% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (15.4% coverage).
Annotation unit is under-specified (23.1% coverage).

Suggested Next Analyses

Stratify by benchmark (LongBench vs TREC) before comparing methods.
Track metric sensitivity by reporting both accuracy and coherence.

Recommended Queries

Benchmark Slice: LongBench Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (15.4% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (5)

Top Metrics

Accuracy (1)
Coherence (1)
Cost (1)
Perplexity (1)

Top Benchmarks

LongBench (1)
TREC (1)

Quality Controls

Papers In This Archive Slice

UI-Venus-1.5 Technical Report
Venus Team, Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu · Feb 9, 2026 · Citations: 0

Long Horizon

In this report, we present UI-Venus-1.5, a unified, end-to-end GUI Agent designed for robust real-world applications.
Automatic In-Domain Exemplar Construction and LLM-Based Refinement of Multi-LLM Expansions for Query Expansion
Minghan Li, Ercong Nie, Siqi Zhao, Tongna Chen, Huiping Huang · Feb 9, 2026 · Citations: 0

Demonstrations

We present an automated, domain-adaptive QE framework that builds in-domain exemplar pools by harvesting pseudo-relevant passages using a BM25-MonoT5 pipeline.
Dynamics Within Latent Chain-of-Thought: An Empirical Study of Causal Structure
Zirui Li, Xuefeng Bai, Kehai Chen, Yizhi Li, Jian Yang · Feb 9, 2026 · Citations: 0
Why do we Trust Chatbots? From Normative Principles to Behavioral Drivers
Aditya Gulati, Nuria Oliver · Feb 9, 2026 · Citations: 0
Prototype-Based Disentanglement for Controllable Dysarthric Speech Synthesis
Haoshen Wang, Xueli Zhong, Bingbing Lin, Jia Huang, Xingduo Pan · Feb 9, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
PBLean: Pseudo-Boolean Proof Certificates for Lean 4
Stefan Szeider · Feb 9, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Automating Computational Reproducibility in Social Science: Comparing Prompt-Based and Agent-Based Approaches
Syed Mehtab Hussain Shah, Frank Hopfgartner, Arnim Bleier · Feb 9, 2026 · Citations: 0
Large Language Models and Impossible Language Acquisition: "False Promise" or an Overturn of our Current Perspective towards AI
Ziyan Wang, Longlong Ma · Feb 9, 2026 · Citations: 0

Critique Edit

In Chomsky's provocative critique "The False Promise of CHATGPT," Large Language Models (LLMs) are characterized as mere pattern predictors that do not acquire languages via intrinsic causal and self-correction structures like humans, there
Breaking the Factorization Barrier in Diffusion Language Models
Ian Li, Zilei Shao, Benjie Wang, Rose Yu, Guy Van den Broeck · Feb 9, 2026 · Citations: 0
ViGoEmotions: A Benchmark Dataset For Fine-grained Emotion Detection on Vietnamese Texts
Hung Quang Tran, Nam Tien Pham, Son T. Luu, Kiet Van Nguyen · Feb 9, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Language Modeling and Understanding Through Paraphrase Generation and Detection
Jan Philip Wahle · Feb 9, 2026 · Citations: 0

Language enables humans to share knowledge, reason about the world, and pass on strategies for survival and innovation across generations.
Document Reconstruction Unlocks Scalable Long-Context RLVR
Yao Xiao, Lei Wang, Yue Deng, Guanzheng Chen, Ziqi Jin · Feb 9, 2026 · Citations: 0

Rubric Rating

However, it often relies on gold-standard answers or explicit evaluation rubrics provided by powerful teacher models or human experts, which are costly and time-consuming.
Pretraining with Token-Level Adaptive Latent Chain-of-Thought
Boyi Zeng, Yiqin Hao, He Li, Shixiang Song, Feichen Song · Feb 9, 2026 · Citations: 0

Long Horizon

We propose Pretraining with Token-Level Adaptive Latent CoT (adaptive latent CoT), where the model generates a variable-length latent CoT trajectory before emitting each token -- allocating longer trajectories to difficult tokens and…

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote