HFEPX Archive Slice

HFEPX Daily Archive: 2026-02-13

Updated from current HFEPX corpus (Apr 12, 2026). 17 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 17 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: BrowseComp. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 13, 2026.

Papers: 17 Last published: Feb 13, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Medium .

High-Signal Coverage

100.0%

17 / 17 papers are not low-signal flagged.

Benchmark Anchors

5.9%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

47.1%

Papers with reported metric mentions in extraction output.

2 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Get this digest every Friday →

Why This Time Slice Matters

17.6% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 47.1% of papers in this hub.
BrowseComp is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is rater calibration (11.8% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Stratify by benchmark (BrowseComp vs LMSYS Chatbot Arena) before comparing methods.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

SCOPE: Selective Conformal Optimized Pairwise LLM Judging
Feb 13, 2026 · Citations: 0 · Score: 8.0

Eval: Automatic Metrics · Metrics: Error rate
Learning Ordinal Probabilistic Reward from Preferences
Feb 13, 2026 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Accuracy
PMG: Parameterized Motion Generator for Human-like Locomotion Control
Feb 13, 2026 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Calibration
OpenLID-v3: Improving the Precision of Closely Related Language Identification -- An Experience Report
Feb 13, 2026 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Precision
Towards interpretable models for language proficiency assessment: Predicting the CEFR level of Estonian learner texts
Feb 13, 2026 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Accuracy
Buy versus Build an LLM: A Decision Framework for Governments
Feb 13, 2026 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Cost

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
SCOPE: Selective Conformal Optimized Pairwise LLM Judging Feb 13, 2026	Automatic Metrics	MT Bench, LMSYS Chatbot Arena	Error rate	Calibration
Learning Ordinal Probabilistic Reward from Preferences Feb 13, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
PMG: Parameterized Motion Generator for Human-like Locomotion Control Feb 13, 2026	Automatic Metrics	Not reported	Calibration	Calibration
OpenLID-v3: Improving the Precision of Closely Related Language Identification -- An Experience Report Feb 13, 2026	Automatic Metrics	Not reported	Precision	Not reported
Towards interpretable models for language proficiency assessment: Predicting the CEFR level of Estonian learner texts Feb 13, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
Buy versus Build an LLM: A Decision Framework for Governments Feb 13, 2026	Automatic Metrics	Not reported	Cost	Not reported
BrowseComp-$V^3$: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents Feb 13, 2026	Automatic Metrics, Simulation Env	Not reported	Accuracy	Not reported
Unleashing Low-Bit Inference on Ascend NPUs: A Comprehensive Evaluation of HiFloat Formats Feb 13, 2026	Automatic Metrics	Not reported	Accuracy, Precision	Not reported
MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs Feb 13, 2026	Not reported	Not reported	Not reported	Not reported
Semantic Chunking and the Entropy of Natural Language Feb 13, 2026	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (17.6% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (11.8% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (17.6% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (23.5% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (11.8% vs 35% target).
Moderate: Papers with known annotation unit

Coverage is usable but incomplete (29.4% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 11.8% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (11.8% coverage).
Benchmark coverage is thin (17.6% of papers mention benchmarks/datasets).

Suggested Next Analyses

Stratify by benchmark (BrowseComp vs LMSYS Chatbot Arena) before comparing methods.
Track metric sensitivity by reporting both accuracy and calibration.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

Benchmark Slice: BrowseComp Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 11.8% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (11.8% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (8)
Simulation Env (1)

Top Metrics

Accuracy (2)
Calibration (1)
Error rate (1)

Top Benchmarks

BrowseComp (1)
LMSYS Chatbot Arena (1)
MT Bench (1)
Rewardbench (1)

Quality Controls

Calibration (2)

Papers In This Archive Slice

Semantic Chunking and the Entropy of Natural Language
Weishun Zhong, Doron Sivan, Tankut Can, Mikhail Katkov, Misha Tsodyks · Feb 13, 2026 · Citations: 0

The entropy rate of printed English is famously estimated to be about one bit per character, a benchmark that modern large language models (LLMs) have only recently approached.
CoPE-VideoLM: Leveraging Codec Primitives For Efficient Video Language Modeling
Sayan Deb Sarkar, Rémi Pautrat, Ondrej Miksik, Marc Pollefeys, Iro Armeni · Feb 13, 2026 · Citations: 0

Moreover, by varying the keyframe and codec primitive densities we maintain or exceed performance on 14 diverse video understanding benchmarks spanning general question answering, temporal and motion reasoning, long-form understanding, and…
Quantization-Robust LLM Unlearning via Low-Rank Adaptation
João Vitor Boer Abitante, Joana Meneguzzo Pasquali, Luan Fonseca Garcia, Ewerton de Oliveira, Thomas da Silva Paula · Feb 13, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
OpenLID-v3: Improving the Precision of Closely Related Language Identification -- An Experience Report
Mariia Fedorova, Nikolay Arefyev, Maja Buljan, Jindřich Helcl, Stephan Oepen · Feb 13, 2026 · Citations: 0

We call this extended system OpenLID-v3 and evaluate it against GlotLID on multiple benchmarks.
SCOPE: Selective Conformal Optimized Pairwise LLM Judging
Sher Badshah, Ali Emami, Hassan Sajjad · Feb 13, 2026 · Citations: 0

Pairwise Preference

Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation.
Towards interpretable models for language proficiency assessment: Predicting the CEFR level of Estonian learner texts
Kais Allkivi · Feb 13, 2026 · Citations: 0

Additional evaluation on an earlier exam sample revealed that the writings have become more complex over a 7-10-year period, while accuracy still reached 0.8 with some feature sets.
Consistency of Large Reasoning Models Under Multi-Turn Attacks
Yubo Li, Ramayya Krishnan, Rema Padman · Feb 13, 2026 · Citations: 0

Long Horizon

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Buy versus Build an LLM: A Decision Framework for Governments
Jiahao Lu, Ziwei Xu, William Tjhi, Junnan Li, Antoine Bosselut · Feb 13, 2026 · Citations: 0

This paper provides a strategic framework for making this decision by evaluating these options across dimensions including sovereignty, safety, cost, resource capability, cultural fit, and sustainability.
BrowseComp-$V^3$: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents
Huanyao Zhang, Jiepeng Zhou, Bo Li, Bowen Zhou, Yanzhe Shan · Feb 13, 2026 · Citations: 0

Web Browsing

Multimodal large language models (MLLMs), equipped with increasingly advanced planning and tool-use capabilities, are evolving into autonomous agents capable of performing multimodal web browsing and deep search in open-world environments.
Towards a Diagnostic and Predictive Evaluation Methodology for Sequence Labeling Tasks
Elena Alvarez-Mellado, Julio Gonzalo · Feb 13, 2026 · Citations: 0

We propose an evaluation methodology for sequence labeling tasks grounded on error analysis that provides both quantitative and qualitative information on where systems must be improved and predicts how models will perform on a different…
MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs
Baorong Shi, Bo Cui, Boyuan Jiang, Deli Yu, Fang Qian · Feb 13, 2026 · Citations: 0

Pairwise PreferenceRubric Rating Long Horizon

MedXIAOHE achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimodal systems on multiple capabilities.
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks
Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen · Feb 13, 2026 · Citations: 0
Learning Ordinal Probabilistic Reward from Preferences
Longze Chen, Lu Wang, Renke Shan, Ze Gong, Run Luo · Feb 13, 2026 · Citations: 0

Pairwise Preference

Reward models are crucial for aligning large language models (LLMs) with human values and intentions.
PMG: Parameterized Motion Generator for Human-like Locomotion Control
Chenxi Han, Yuheng Min, Zihao Huang, Ao Hong, Hang Liu · Feb 13, 2026 · Citations: 0

Long Horizon

Recent advances in data-driven reinforcement learning and motion tracking have substantially improved humanoid locomotion, yet critical practical challenges remain.
Unleashing Low-Bit Inference on Ascend NPUs: A Comprehensive Evaluation of HiFloat Formats
Pengxiang Zhao, Hui-Ling Zhen, Xing Li, Han Bao, Weizhe Lin · Feb 13, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Discovering Semantic Latent Structures in Psychological Scales: A Response-Free Pathway to Efficient Simplification
Bo Wang, Yuxuan Zhang, Yueqin Hu, Hanchao Hou, Kaiping Peng · Feb 13, 2026 · Citations: 0

We benchmarked the framework across DASS, IPIP, and EPOCH, evaluating structural recovery, internal consistency, factor congruence, correlation preservation, and reduction efficiency.
To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models
Haoqing Wang, Xiang Long, Ziheng Li, Yilong Xu, Tingguang Li · Feb 13, 2026 · Citations: 0

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now