HFEPX Archive Slice

HFEPX Daily Archive: 2026-02-12

Updated from current HFEPX corpus (Apr 12, 2026). 25 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 25 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequently cited benchmark: RooflineBench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 12, 2026.

Papers: 25 Last published: Feb 12, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

High-Signal Coverage

100.0%

25 / 25 papers are not low-signal flagged.

Benchmark Anchors

8.0%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

32.0%

Papers with reported metric mentions in extraction output.

0 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Get this digest every Friday →

Why This Time Slice Matters

16% of papers report explicit human-feedback signals, led by demonstration data.
automatic metrics appears in 32% of papers in this hub.
RooflineBench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.
Stratify by benchmark (RooflineBench vs Zoombench) before comparing methods.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception
Feb 12, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Latency
Query-focused and Memory-aware Reranker for Long Context Processing
Feb 12, 2026 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Accuracy, Relevance
propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale
Feb 12, 2026 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Agreement, Relevance
On-Policy Context Distillation for Language Models
Feb 12, 2026 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Accuracy
Think like a Scientist: Physics-guided LLM Agent for Equation Discovery
Feb 12, 2026 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Accuracy
"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most
Feb 12, 2026 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Accuracy, Error rate

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception Feb 12, 2026	Automatic Metrics	Zoombench	Latency	Not reported
Query-focused and Memory-aware Reranker for Long Context Processing Feb 12, 2026	Automatic Metrics	Not reported	Accuracy, Relevance	Not reported
propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale Feb 12, 2026	Automatic Metrics	Not reported	Agreement, Relevance	Not reported
On-Policy Context Distillation for Language Models Feb 12, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
Think like a Scientist: Physics-guided LLM Agent for Equation Discovery Feb 12, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most Feb 12, 2026	Automatic Metrics	Not reported	Accuracy, Error rate	Not reported
Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models Feb 12, 2026	Automatic Metrics	Not reported	F1, F1 macro	Not reported
SToRM: Supervised Token Reduction for Multi-modal LLMs toward efficient end-to-end autonomous driving Feb 12, 2026	Automatic Metrics	Not reported	Cost	Not reported
TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents Feb 12, 2026	Not reported	WebShop	Not reported	Not reported
Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation Feb 12, 2026	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (16% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (8% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (32% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (8% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (12% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (8% coverage).
Annotation unit is under-specified (12% coverage).

Suggested Next Analyses

Stratify by benchmark (RooflineBench vs Zoombench) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.

Recommended Queries

Benchmark Slice: RooflineBench Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (8% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (8)

Top Metrics

Accuracy (5)
Cost (2)
Latency (2)
Mae (1)

Top Benchmarks

RooflineBench (1)
Zoombench (1)

Quality Controls

Papers In This Archive Slice

propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale
Maximilian Idahl, Benedikt Droste, Björn Plüster, Jan Philipp Harries · Feb 12, 2026 · Citations: 0

We introduce propella-1, a family of small multilingual LLMs (0.6B, 1.7B, 4B parameters) that annotate text documents across 18 properties organized into six categories: core content, classification, quality and value, audience and purpose,…
Evolving Beyond Snapshots: Harmonizing Structure and Sequence via Entity State Tuning for Temporal Knowledge Graph Forecasting
Siyuan Li, Yunjia Wu, Yiyong Xiao, Pingyang Huang, Peize Li · Feb 12, 2026 · Citations: 0

Long Horizon

Experiments on multiple benchmarks show that EST consistently improves diverse backbones and achieves state-of-the-art performance, highlighting the importance of state persistence for long-horizon TKG forecasting.
On-Policy Context Distillation for Language Models
Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, Furu Wei · Feb 12, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Think like a Scientist: Physics-guided LLM Agent for Equation Discovery
Jianke Yang, Ohm Venkatachalam, Mohammad Kianezhad, Sharvaree Vadgama, Rose Yu · Feb 12, 2026 · Citations: 0

Long Horizon

We introduce KeplerAgent, an agentic framework that explicitly follows this scientific reasoning process.
"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most
Kaitlyn Zhou, Martijn Bartelds, Federico Bianchi, James Zou · Feb 12, 2026 · Citations: 0

Despite speech recognition systems achieving low word error rates on standard benchmarks, they often fail on short, high-stakes utterances in real-world deployments.
Energy-Aware Spike Budgeting for Continual Learning in Spiking Neural Networks for Neuromorphic Vision
Anika Tabassum Meem, Muntasir Hossain Nadid, Md Zesun Ahmed Mia · Feb 12, 2026 · Citations: 0
Query-focused and Memory-aware Reranker for Long Context Processing
Yuqing Li, Jiangnan Li, Mo Yu, Guoxuan Ding, Zheng Lin · Feb 12, 2026 · Citations: 0

Rubric Rating

It further establishes a new state-of-the-art on the LoCoMo benchmark that assesses the capabilities of dialogue understanding and memory usage.
GPT-4o Lacks Core Features of Theory of Mind
John Muchovej, Amanda Royka, Shane Lee, Julian Jara-Ettinger · Feb 12, 2026 · Citations: 0

Research into this question has focused on evaluating LLMs against benchmarks and found success across a range of social tasks.
Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation
Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang · Feb 12, 2026 · Citations: 0

Expert Verification

Then, we propose the Generalized On-Policy Distillation (G-OPD) framework, which extends the standard OPD objective by introducing a flexible reference model and a reward scaling factor that controls the relative weight of the reward term…
Stop Unnecessary Reflection: Training LRMs for Efficient Reasoning with Adaptive Reflection and Length Coordinated Penalty
Zewei Yu, Lirong Gao, Yuke Zhu, Bo Zheng, Junbo Zhao · Feb 12, 2026 · Citations: 0
Tiny Recursive Reasoning with Mamba-2 Attention Hybrid
Wenlong Wang, Fergal Reid · Feb 12, 2026 · Citations: 0
Scaling Model and Data for Multilingual Machine Translation with Open Large Language Models
Yuzhe Shang, Pengzhi Gao, Wei Liu, Jian Luan, Jinsong Su · Feb 12, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Who is the richest club in the championship? Detecting and Rewriting Underspecified Questions Improve QA Performance
Yunchong Huang, Gianni Barlacchi, Sandro Pezzelle · Feb 12, 2026 · Citations: 0

Large language models (LLMs) perform well on well-posed questions, yet standard question-answering (QA) benchmarks remain far from solved.
Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception
Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai · Feb 12, 2026 · Citations: 0

Tool Use

To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM.
Predicting LLM Output Length via Entropy-Guided Representations
Huanyi Xie, Yubin Chen, Liangyu Wang, Lijie Hu, Di Wang · Feb 12, 2026 · Citations: 0
TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents
Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Heiko Ludwig, Holger Boche · Feb 12, 2026 · Citations: 0

Long Horizon

Advances in large language models (LLMs) are driving a shift toward using reinforcement learning (RL) to train agents from iterative, multi-turn interactions across tasks.
MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling
MiniCPM Team, Wenhao An, Yingfa Chen, Yewei Fang, Jiayi Li · Feb 12, 2026 · Citations: 0
Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models
Ali Mekky, Mohamed El Zeftawy, Lara Hassan, Amr Keleg, Preslav Nakov · Feb 12, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
SToRM: Supervised Token Reduction for Multi-modal LLMs toward efficient end-to-end autonomous driving
Seo Hyun Kim, Jin Bok Park, Do Yeon Koo, Hogun Park, Il Yong Chun · Feb 12, 2026 · Citations: 0

For safe driving in unexpected scenarios, these systems may additionally rely on human interventions such as natural language instructions.
Variation-aware Flexible 3D Gaussian Editing
Hao Qin, Yukai Sun, Meng Wang, Ming Kong, Mengxu Lu · Feb 12, 2026 · Citations: 0
Native Reasoning Models: Training Language Models to Reason on Unverifiable Data
Yuanfu Wang, Zhixuan Liu, Xiangtian Li, Chaochao Lu, Chao Yang · Feb 12, 2026 · Citations: 0

Demonstrations

The prevailing paradigm for training large reasoning models--combining Supervised Fine-Tuning (SFT) with Reinforcement Learning with Verifiable Rewards (RLVR)--is fundamentally constrained by its reliance on high-quality, human-annotated…
OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model
Maomao Li, Zhen Li, Kaipeng Zhang, Guosheng Yin, Zhifeng Li · Feb 12, 2026 · Citations: 0

Third, we train OmniCustom on our constructed large-scale, high-quality audio-visual human dataset.
RooflineBench: A Benchmarking Framework for On-Device LLMs via Roofline Analysis
Zhen Bi, Xueshu Chen, Luoyang Sun, Yuhang Yao, Qing Shen · Feb 12, 2026 · Citations: 0
Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models
Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis · Feb 12, 2026 · Citations: 0

Red Team

On an abliterated LLaMA-3.1-8B model, selectively bypassing high-susceptibility layers blocks 78% of jailbreak attempts while preserving benign behavior on 94% of benign prompts.
When Audio-LLMs Don't Listen: A Cross-Linguistic Study of Modality Arbitration
Jayadev Billa · Feb 12, 2026 · Citations: 0

We introduce ALME (Audio-LLM Modality Evaluation), a dataset of 57,602 controlled audio-text conflict stimuli across eight languages, together with Text Dominance Ratio (TDR), which measures how often a model follows conflicting text when…

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote