HFEPX Archive Slice

HFEPX Daily Archive: 2026-02-15

Updated from current HFEPX corpus (Apr 12, 2026). 42 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 42 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Adjudication. Frequently cited benchmark: Ad-Bench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 15, 2026.

Papers: 42 Last published: Feb 15, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

High-Signal Coverage

100.0%

42 / 42 papers are not low-signal flagged.

Benchmark Anchors

21.4%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

47.6%

Papers with reported metric mentions in extraction output.

2 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.

Get this digest every Friday →

Why This Time Slice Matters

19% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 45.2% of papers in this hub.
Ad-Bench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is adjudication (2.4% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam
Feb 15, 2026 · Citations: 0 · Score: 8.0

Eval: Automatic Metrics · Metrics: Accuracy
AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents
Feb 15, 2026 · Citations: 0 · Score: 7.0

Eval: Simulation Env · Metrics: Pass@1, Pass@3
MAGE: All-[MASK] Block Already Knows Where to Look in Diffusion LLM
Feb 15, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy, Cost
The Sufficiency-Conciseness Trade-off in LLM Self-Explanation from an Information Bottleneck Perspective
Feb 15, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy, Conciseness
Neuromem: A Granular Decomposition of the Streaming Lifecycle in External Memory for LLMs
Feb 15, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy, F1
LM-Lexicon: Improving Definition Modeling via Harmonizing Semantic Experts
Feb 15, 2026 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Bleu

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam Feb 15, 2026	Automatic Metrics	HLE	Accuracy	Adjudication
AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents Feb 15, 2026	Simulation Env	Ad Bench	Pass@1, Pass@3	Not reported
MAGE: All-[MASK] Block Already Knows Where to Look in Diffusion LLM Feb 15, 2026	Automatic Metrics	LongBench, Needle In A Haystack	Accuracy, Cost	Not reported
The Sufficiency-Conciseness Trade-off in LLM Self-Explanation from an Information Bottleneck Perspective Feb 15, 2026	Automatic Metrics	ARC Challenge	Accuracy, Conciseness	Not reported
Neuromem: A Granular Decomposition of the Streaming Lifecycle in External Memory for LLMs Feb 15, 2026	Automatic Metrics	Insertion And Retrieval, Longmemeval	Accuracy, F1	Not reported
LM-Lexicon: Improving Definition Modeling via Harmonizing Semantic Experts Feb 15, 2026	Automatic Metrics	Not reported	Bleu	Not reported
We can still parse using syntactic rules Feb 15, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
REDSearcher: A Scalable and Cost-Efficient Framework for Long-Horizon Search Agents Feb 15, 2026	Automatic Metrics	Not reported	Recall, Cost	Not reported
Reasoning Language Models for complex assessments tasks: Evaluating parental cooperation from child protection case reports Feb 15, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
Knowing When Not to Answer: Abstention-Aware Scientific Reasoning Feb 15, 2026	Automatic Metrics	Not reported	Accuracy	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (19% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (4.8% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (9.5% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (16.7% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (14.3% vs 35% target).
Moderate: Papers with known annotation unit

Coverage is usable but incomplete (23.8% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 4.8% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (14.3% coverage).
Annotation unit is under-specified (23.8% coverage).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (Ad-Bench vs ARC-Challenge) before comparing methods.
Track metric sensitivity by reporting both accuracy and bleu.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: Ad-Bench Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 4.8% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (14.3% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (19)
Llm As Judge (2)
Simulation Env (2)

Top Metrics

Accuracy (4)
Bleu (2)
Conciseness (1)
Cost (1)

Top Benchmarks

Ad Bench (1)
ARC Challenge (1)
HLE (1)
OSWorld (1)

Quality Controls

Adjudication (1)
Calibration (1)

Papers In This Archive Slice

Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook
Ming Li, Xirui Li, Tianyi Zhou · Feb 15, 2026 · Citations: 0

We present the first large-scale systemic diagnosis of this AI agent society.
FMMD: A multimodal open peer review dataset based on F1000Research
Zhenzhen Zhuang, Yuqing Fu, Jing Zhu, Zhangping Zhou, Jialiang Lin · Feb 15, 2026 · Citations: 0

Automated scholarly paper review (ASPR) has entered the coexistence phase with traditional peer review, where artificial intelligence (AI) systems are increasingly incorporated into real-world manuscript evaluation.
MCPShield: A Security Cognition Layer for Adaptive Trust Calibration in Model Context Protocol Agents
Zhenhong Zhou, Yuanhe Zhang, Hongwei Cai, Moayad Aloqaily, Ouns Bouachir · Feb 15, 2026 · Citations: 0

Tool Use

The Model Context Protocol (MCP) standardizes tool use for LLM-based agents and enable third-party servers.
Whom to Query for What: Adaptive Group Elicitation via Multi-Turn LLM Interactions
Ruomeng Ding, Tianwei Gao, Thomas P. Zollo, Eitan Bachmat, Richard Zemel · Feb 15, 2026 · Citations: 0

To address this gap, we study adaptive group elicitation, a multi-round setting where an agent adaptively selects both questions and respondents under explicit query and participation budgets.
STATe-of-Thoughts: Structured Action Templates for Tree-of-Thoughts
Zachary Bamberger, Till R. Saenger, Gilad Morad, Ofra Amir, Brandon M. Stewart · Feb 15, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Detecting LLM Hallucinations via Embedding Cluster Geometry: A Three-Type Taxonomy with Measurable Signatures
Matic Korun · Feb 15, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents
Lingxiang Hu, Yiding Sun, Tianle Xia, Wenwei Li, Ming Xu · Feb 15, 2026 · Citations: 0

Expert Verification Long Horizon

While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real-world environments has become a critical problem.
We can still parse using syntactic rules
Ghaly Hussein · Feb 15, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
REDSearcher: A Scalable and Cost-Efficient Framework for Long-Horizon Search Agents
Zheng Chu, Xiao Wang, Jack Hong, Huiming Fan, Yuqi Huang · Feb 15, 2026 · Citations: 0

Tool Use

To address these challenges, we propose REDSearcher, a unified framework that codesigns complex task synthesis, midtraining, and posttraining for scalable searchagent optimization.
The Interspeech 2026 Audio Reasoning Challenge: Evaluating Reasoning Process Quality for Audio Reasoning Models and Agents
Ziyang Ma, Ruiyang Xu, Yinghao Ma, Chao-Han Huck Yang, Bohan Li · Feb 15, 2026 · Citations: 0

Rubric Rating

Featured Single Model and Agent tracks, the competition attracting 156 teams from 18 countries and regions.
Reasoning Language Models for complex assessments tasks: Evaluating parental cooperation from child protection case reports
Dragan Stoll, Brian E. Perron, Zia Qi, Selina Steinmann, Nicole F. Eicher · Feb 15, 2026 · Citations: 0

The performance of RLMs with different parameter sizes (255B, 32B, 4B) was compared against human validated data.
MAGE: All-[MASK] Block Already Knows Where to Look in Diffusion LLM
Omin Kwon, Yeonjae Kim, Doyeon Kim, Minseo Kim, Yeonhong Park · Feb 15, 2026 · Citations: 0

Across long-context benchmarks including LongBench and Needle-in-a-Haystack, MAGE achieves near-lossless accuracy with a fraction of the KV budget while delivering up to 3-4x end-to-end speedup, consistently outperforming AR-oriented sparse…
Knowing When Not to Answer: Abstention-Aware Scientific Reasoning
Samir Abdaljalil, Erchin Serpedin, Hasan Kurban · Feb 15, 2026 · Citations: 0

We evaluate this framework across two complementary scientific benchmarks: SciFact and PubMedQA, covering both closed-book and open-domain evidence settings.
GPT-5 vs Other LLMs in Long Short-Context Performance
Nima Esmi, Maryam Nezhad-Moghaddam, Fatemeh Borhani, Asadollah Shahbahrami, Amin Daemdoost · Feb 15, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model
Shaobin Zhuang, Yuang Ai, Jiaming Han, Weijia Mao, Xiaohui Li · Feb 15, 2026 · Citations: 0
Investigation for Relative Voice Impression Estimation
Kenichi Fujita, Yusuke Ijima · Feb 15, 2026 · Citations: 0

Pairwise Preference

The estimation target is a low-dimensional vector derived from subjective evaluations, quantifying the perceptual shift of the second utterance relative to the first along an antonymic axis (e.g., ``Dark--Bright'').
Deep Dense Exploration for LLM Reinforcement Learning via Pivot-Driven Resampling
Yiran Guo, Zhongjian Qiao, Yingqi Xie, Jie Liu, Dan Ye · Feb 15, 2026 · Citations: 0

Experiments on mathematical reasoning benchmarks demonstrate that our method consistently outperforms GRPO, tree-based methods, and other strong baselines.
Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering
Tao Xu · Feb 15, 2026 · Citations: 0

16.1\% (+14.5pp); on CircuitVQA, a public benchmark (9,315 questions), retrieval ImgR@3 achieves 31.2\% vs.
A Multi-Agent Framework for Medical AI: Leveraging Fine-Tuned GPT, LLaMA, and DeepSeek R1 for Evidence-Based and Bias-Aware Clinical Query Processing
Naeimeh Nourmohammadi, Md Meem Hossain, The Anh Han, Safina Showkat Ara, Zia Ush Shamszaman · Feb 15, 2026 · Citations: 0

Multi Agent

We propose a multi-agent medical QA framework that combines complementary LLMs with evidence retrieval, uncertainty estimation, and bias checks to improve answer reliability.
ROAST: Rollout-based On-distribution Activation Steering Technique
Xuanbo Su, Hao Luo, Yingfang Zhang, Lijun Zhang · Feb 15, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Algebraic Quantum Intelligence: A New Framework for Reproducible Machine Creativity
Kazuo Yano, Jonghyeok Lee, Tae Ishitomi, Hironobu Kawaguchi, Akira Koyama · Feb 15, 2026 · Citations: 0

We evaluate the resulting system on creative reasoning benchmarks spanning ten domains under an LLM-as-a-judge protocol.
Character-aware Transformers Learn an Irregular Morphological Pattern Yet None Generalize Like Humans
Akhilesh Kakolu Ramarao, Kevin Tang, Dinah Baer-Henney · Feb 15, 2026 · Citations: 0

Recent work has shown that encoder-decoder models can acquire irregular patterns, but evidence that they generalize these patterns like humans is mixed.
CCiV: A Benchmark for Structure, Rhythm and Quality in LLM-Generated Chinese \textit{Ci} Poetry
Shangqing Zhao, Yupei Ren, Yuhao Zhou, Xiaopeng Bai, Man Lan · Feb 15, 2026 · Citations: 0

To systematically evaluate and advance this capability, we introduce Chinese Cipai Variants (CCiV), a benchmark designed to assess LLM-generated Ci poetry across these three dimensions: structure, rhythm, and quality.
Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality
Nitay Calderon, Eyal Ben-David, Zorik Gekhman, Eran Ofek, Gal Yona · Feb 15, 2026 · Citations: 0

To support such profiling, we introduce WikiProfile, a new benchmark constructed via an automated pipeline with a prompted LLM grounded in web search.
GTS: Inference-Time Scaling of Latent Reasoning with a Learnable Gaussian Thought Sampler
Minghan Wang, Ye Bai, Thuy-Trang Vu, Ehsan Shareghi, Gholamreza Haffari · Feb 15, 2026 · Citations: 0

Experiments across multiple benchmarks and two latent reasoning architectures show that GTS yields more reliable inference-time scaling than heuristic baselines, suggesting that effective latent ITS requires better-controlled and…
Annotation-Efficient Vision-Language Model Adaptation to the Polish Language Using the LLaVA Framework
Grzegorz Statkiewicz, Alicja Dobrzeniecka, Karolina Seweryn, Aleksandra Krasnodębska, Karolina Piosek · Feb 15, 2026 · Citations: 0

Despite relying almost entirely on automatic translation and minimal manual intervention to the training data, our approach yields strong results: we observe a +9.5% improvement over LLaVA-1.6-Vicuna-13B on a Polish-adapted MMBench, along…
Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric
Ruipeng Jia, Yunyi Yang, Yuxin Wu, Yongbo Gai, Siyuan Tao · Feb 15, 2026 · Citations: 0

Pairwise PreferenceRubric Rating

To operationalize this view, we present the Open Rubric System (OpenRS), a plug-and-play, rubrics-based LLM-as-a-Judge framework built around Pairwise Adaptive Meta-Rubrics (PAMR) and lightweight Pointwise Verifiable Rubrics (PVRs), which…
From Scarcity to Scale: A Release-Level Analysis of the Pashto Common Voice Dataset
Jandad Jahani, Mursal Dawodi, Jawid Ahmad Baktash · Feb 15, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
LM-Lexicon: Improving Definition Modeling via Harmonizing Semantic Experts
Yang Liu, Jiaye Yang, Weikang Li, Jiahui Liang, Yang Li · Feb 15, 2026 · Citations: 0

Expert Verification

By decomposing the definition modeling task into specialized semantic domains, where small language models are trained as domain experts, LM-Lexicon achieves substantial improvements (+7% BLEU score compared with the prior state-of-the-art…
LogitsCoder: Towards Efficient Chain-of-Thought Path Search via Logits Preference Decoding for Code Generation
Jizheng Chen, Weiming Zhang, Xinyi Dai, Weiwen Liu, Kounianhua Du · Feb 15, 2026 · Citations: 0

Pairwise Preference

LogitsCoder iteratively generates and refines reasoning steps by first steering token selection toward statistically preferred patterns via Logits Preference Decoding, then selecting and aggregating diverse reasoning paths using Logits Rank…
Context Shapes LLMs Retrieval-Augmented Fact-Checking Effectiveness
Pietro Bernardelle, Stefano Civelli, Kevin Roitero, Gianluca Demartini · Feb 15, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
BitDance: Scaling Autoregressive Generative Models with Binary Tokens
Yuang Ai, Jiaming Han, Shaobin Zhuang, Weijia Mao, Xuefeng Hu · Feb 15, 2026 · Citations: 0
Geometry-Preserving Aggregation for Mixture-of-Experts Embedding Models
Sajjad Kachuee, Mohammad Sharifkhani · Feb 15, 2026 · Citations: 0

Experiments on selected tasks from the Massive Text Embedding Benchmark (MTEB), including semantic similarity, clustering, and duplicate question detection, demonstrate consistent performance improvements with identical training cost and…
GRRM: Group Relative Reward Modeling for Machine Translation
Sen Yang, Shanbo Cheng, Lu Xu, Jianbing Zhang, Shujian Huang · Feb 15, 2026 · Citations: 0

Empirical evaluations confirm that GRRM achieves competitive ranking accuracy among all baselines.
Named Entity Recognition for Payment Data Using NLP
Srikumar Nayak · Feb 15, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
The Sufficiency-Conciseness Trade-off in LLM Self-Explanation from an Information Bottleneck Perspective
Ali Zahedzadeh, Behnam Bahrak · Feb 15, 2026 · Citations: 0

Long Horizon

Building on the information bottleneck principle, we conceptualize explanations as compressed representations that retain only the information essential for producing correct answers.To operationalize this view, we introduce an evaluation…
Chain-of-Thought Reasoning with Large Language Models for Clinical Alzheimer's Disease Assessment and Diagnosis
Tongze Zhang, Jun-En Ding, Melik Ozolcer, Fang-Ming Hung, Albert Chih-Chieh Yang · Feb 15, 2026 · Citations: 0

Traditional diagnosis still relies heavily on medical imaging and clinical assessment by physicians, which is often time-consuming and resource-intensive in terms of both human expertise and healthcare resources.
Neuromem: A Granular Decomposition of the Streaming Lifecycle in External Memory for LLMs
Ruicheng Zhang, Xinyi Li, Tianyi Xu, Shuhao Zhang, Xiaofei Liao · Feb 15, 2026 · Citations: 0

We present Neuromem, a scalable testbed that benchmarks External Memory Modules under an interleaved insertion-and-retrieval protocol and decomposes its lifecycle into five dimensions including memory data structure, normalization strategy,…
HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam
Weiqi Zhai, Zhihai Wang, Jinghang Wang, Boyu Yang, Xiaogang Li · Feb 15, 2026 · Citations: 0

Expert VerificationCritique Edit

Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions.
MarsRetrieval: Benchmarking Vision-Language Models for Planetary-Scale Geospatial Retrieval on Mars
Shuoyuan Wang, Yiran Wang, Hongxin Wei · Feb 15, 2026 · Citations: 0

We introduce MarsRetrieval, a retrieval benchmark for evaluating vision-language models for Martian geospatial discovery.
Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents
Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu · Feb 15, 2026 · Citations: 0

Long Horizon

The paper introduces GUI-Owl-1.5, the latest native GUI agent model that features instruct/thinking variants in multiple sizes (2B/4B/8B/32B/235B) and supports a range of platforms (desktop, mobile, browser, and more) to enable cloud-edge…
Why Code, Why Now: Learnability, Computability, and the Real Limits of Machine Learning
Zhimin Zhao · Feb 15, 2026 · Citations: 0

Pairwise Preference

We propose a five-level hierarchy of learnability based on information structure and argue that the ceiling on ML progress depends less on model size than on whether a task is learnable at all.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote