HFEPX Archive Slice

HFEPX Monthly Archive: 2026-02

Updated from current HFEPX corpus (Apr 12, 2026). 1089 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 1089 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: SWE-bench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 28, 2026.

Papers: 1,089 Last published: Feb 28, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

Analysis blocks are computed from the loaded sample (60 of 1,089 papers).

High-Signal Coverage

100.0%

60 / 60 papers are not low-signal flagged.

Benchmark Anchors

5.0%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

18.3%

Papers with reported metric mentions in extraction output.

2 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Get this digest every Friday →

Why This Time Slice Matters

12.1% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 33.1% of papers in this hub.
SWE-bench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is rater calibration (1.8% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking
Feb 27, 2026 · Citations: 0 · Score: 7.0

Eval: Llm As Judge · Metrics: Success rate, Jailbreak success rate
RewardUQ: A Unified Framework for Uncertainty-Aware Reward Models
Feb 27, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Accuracy
DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science
Feb 27, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy
Confusion-Aware Rubric Optimization for LLM-based Automated Grading
Feb 28, 2026 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Accuracy, Precision
Transformers Remember First, Forget Last: Dual-Process Interference in LLMs
Feb 27, 2026 · Citations: 0 · Score: 5.0

Eval: Not reported · Metrics: Cost
SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?
Feb 28, 2026 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Success rate

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking Feb 27, 2026	Llm As Judge	AdvBench, Jbf Eval	Success rate, Jailbreak success rate	Not reported
RewardUQ: A Unified Framework for Uncertainty-Aware Reward Models Feb 27, 2026	Automatic Metrics	Not reported	Accuracy	Calibration
DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science Feb 27, 2026	Automatic Metrics	Dare Bench	Accuracy	Not reported
Confusion-Aware Rubric Optimization for LLM-based Automated Grading Feb 28, 2026	Automatic Metrics	Not reported	Accuracy, Precision	Not reported
Transformers Remember First, Forget Last: Dual-Process Interference in LLMs Feb 27, 2026	Not reported	Consolidation Retrieval	Cost	Not reported
SkillCraft: Can LLM Agents Learn to Use Tools Skillfully? Feb 28, 2026	Automatic Metrics	Not reported	Success rate	Not reported
BLUFF: Benchmarking the Detection of False and Synthetic Content across 58 Low-Resource Languages Feb 28, 2026	Automatic Metrics	Not reported	F1	Not reported
When Metrics Disagree: Automatic Similarity vs. LLM-as-a-Judge for Clinical Dialogue Evaluation Feb 27, 2026	Llm As Judge, Automatic Metrics	Not reported	Precision, Bleu	Not reported
Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning Feb 27, 2026	Automatic Metrics	Not reported	Accuracy, Cost	Not reported
Terminology Rarity Predicts Catastrophic Failure in LLM Translation of Low-Resource Ancient Languages: Evidence from Ancient Greek Feb 27, 2026	Human Eval, Automatic Metrics	Not reported	Bleu, Rouge	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (12.1% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (3.6% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (6.7% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (18.6% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (9.3% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (10.7% vs 35% target).

Strengths

Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.

Known Gaps

Only 3.6% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (9.3% coverage).
Annotation unit is under-specified (10.7% coverage).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (SWE-bench vs DROP) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.

Recommended Queries

Judge vs Human Agreement Benchmark Slice: SWE-bench Metric Slice: accuracy IAA-Reported Evaluations Recent High-Signal Papers

Known Limitations

Only 3.6% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (9.3% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (360)
Simulation Env (45)
Llm As Judge (19)
Human Eval (16)

Top Metrics

Accuracy (97)
Cost (40)
Latency (19)
Precision (18)

Top Benchmarks

SWE Bench (6)
DROP (5)
BrowseComp (4)
SWE Bench Verified (4)

Quality Controls

Calibration (20)
Inter Annotator Agreement Reported (12)
Adjudication (8)
Gold Questions (2)

Papers In This Archive Slice

Learning Nested Named Entity Recognition from Flat Annotations
Igor Rozhkov, Natalia Loukachevitch · Feb 28, 2026 · Citations: 0
Constitutional Black-Box Monitoring for Scheming in LLM Agents
Simon Storf, Rich Barton-Cooper, James Peters-Gill, Marius Hobbhahn · Feb 28, 2026 · Citations: 0
A Gauge Theory of Superposition: Toward a Sheaf-Theoretic Atlas of Neural Representations
Hossein Javidnia · Feb 28, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
A Comprehensive Evaluation of LLM Unlearning Robustness under Multi-Turn Interaction
Ruihao Pan, Suhang Wang · Feb 28, 2026 · Citations: 0

Machine unlearning aims to remove the influence of specific training data from pre-trained models without retraining from scratch, and is increasingly important for large language models (LLMs) due to safety, privacy, and legal concerns.
Qwen3-Coder-Next Technical Report
Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng · Feb 28, 2026 · Citations: 0
LaSTR: Language-Driven Time-Series Segment Retrieval
Kota Dohi, Harsh Purohit, Tomoya Nishida, Takashi Endo, Yusuke Ohtsubo · Feb 28, 2026 · Citations: 0
RLAR: An Agentic Reward System for Multi-task Reinforcement Learning on Large Language Models
Andrew Zhuoer Feng, Cunxiang Wang, Bosi Wen, Yidong Wang, Yu Luo · Feb 28, 2026 · Citations: 0
SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?
Shiqi Chen, Jingze Gai, Ruochen Zhou, Jinghan Zhang, Tongyao Zhu · Feb 28, 2026 · Citations: 0

Long Horizon

Real-world tool-using agents operate over long-horizon workflows with recurring structure and diverse demands, where effective behavior requires not only invoking atomic tools but also abstracting, and reusing higher-level tool…
DRIV-EX: Counterfactual Explanations for Driving LLMs
Amaia Cardiel, Eloi Zablocki, Elias Ramzi, Eric Gaussier · Feb 28, 2026 · Citations: 0
RAVEL: Reasoning Agents for Validating and Evaluating LLM Text Synthesis
Andrew Zhuoer Feng, Cunxiang Wang, Yu Luo, Bosi Wen, Yidong Wang · Feb 28, 2026 · Citations: 0
Polynomial Mixing for Efficient Self-supervised Speech Encoders
Eva Feillet, Ryan Whetten, David Picard, Alexandre Allauzen · Feb 28, 2026 · Citations: 0
SSKG Hub: An Expert-Guided Platform for LLM-Empowered Sustainability Standards Knowledge Graphs
Chaoyue He, Xin Zhou, Xinjia Yu, Lei Zhang, Yan Zhang · Feb 28, 2026 · Citations: 0

Expert Verification

We present SSKG Hub (Sustainability Standards Knowledge Graph Hub), a research prototype and interactive web platform that transforms standards into auditable knowledge graphs (KGs) through an LLM-centered, expert-guided pipeline.
BLUFF: Benchmarking the Detection of False and Synthetic Content across 58 Low-Resource Languages
Jason Lucas, Matt Murtagh-White, Adaku Uchendu, Ali Al-Lawati, Michiharu Yamashita · Feb 28, 2026 · Citations: 0

Multi Agent

We introduce BLUFF, a comprehensive benchmark for detecting false and synthetic content, spanning 79 languages with over 202K samples, combining human-written fact-checked content (122K+ samples across 57 languages) and LLM-generated…
TraceSIR: A Multi-Agent Framework for Structured Analysis and Reporting of Agentic Execution Traces
Shu-Xun Yang, Cunxiang Wang, Haoke Zhang, Wenbo Yu, Lindong Wu · Feb 28, 2026 · Citations: 0
Piecing Together Cross-Document Coreference Resolution Datasets: Systematic Dataset Analysis and Unification
Anastasia Zhukova, Terry Ruas, Jan Philip Wahle, Bela Gipp · Feb 28, 2026 · Citations: 0

To address these challenges, we introduce uCDCR, a unified dataset that consolidates diverse publicly available English CDCR corpora across various domains into a consistent format, which we analyze with standardized metrics and evaluation…
QQ: A Toolkit for Language Identifiers and Metadata
Wessel Poelman, Yiyi Chen, Miryam de Lhoneux · Feb 28, 2026 · Citations: 0
From Literature to Hypotheses: An AI Co-Scientist System for Biomarker-Guided Drug Combination Hypothesis Generation
Raneen Younis, Suvinava Basak, Lukas Chavez, Zahra Ahmadi · Feb 28, 2026 · Citations: 0
LangGap: Diagnosing and Closing the Language Gap in Vision-Language-Action Models
Yuchen Hou, Lin Zhao · Feb 28, 2026 · Citations: 0
Super Research: Answering Highly Complex Questions with Large Language Models through Super Deep and Super Wide Research
Yubo Dong, Nianhao You, Yuxuan Hou, Zixun Sun, Yue Zhang · Feb 28, 2026 · Citations: 0

Long Horizon

To evaluate this capability, we curated a benchmark of 300 expert-written questions across diverse domains, each requiring up to 100+ retrieval steps and 1,000+ web pages to reconcile conflicting evidence.
Draft-Thinking: Learning Efficient Reasoning in Long Chain-of-Thought LLMs
Jie Cao, Tianwei Lin, Zhenxuan Fan, Bo Yuan, Ziyuan Zhao · Feb 28, 2026 · Citations: 0
CoMoL: Efficient Mixture of LoRA Experts via Dynamic Core Space Merging
Jie Cao, Zhenxuan Fan, Zhuonan Wang, Tianwei Lin, Ziyuan Zhao · Feb 28, 2026 · Citations: 0
CIRCUS: Circuit Consensus under Uncertainty via Stability Ensembles
Swapnil Parekh · Feb 28, 2026 · Citations: 0
Optimizing In-Context Demonstrations for LLM-based Automated Grading
Yucheng Chu, Hang Li, Kaiqi Yang, Yasemin Copur-Gencturk, Kevin Haudek · Feb 28, 2026 · Citations: 0

Rubric RatingDemonstrations

GUIDE paves the way for trusted, scalable assessment systems that align closely with human pedagogical standards.
Confusion-Aware Rubric Optimization for LLM-based Automated Grading
Yucheng Chu, Hang Li, Kaiqi Yang, Yasemin Copur-Gencturk, Joseph Krajcik · Feb 28, 2026 · Citations: 0

Rubric Rating

Empirical evaluations on teacher education and STEM datasets demonstrate that CARO significantly outperforms existing SOTA methods.
RTLocating: Intent-aware RTL Localization for Hardware Design Iteration
Changwen Xing, Yanfeng Lu, Lei Qi, Chenxu Niu, Jie Li · Feb 28, 2026 · Citations: 0
A Typologically Grounded Evaluation Framework for Word Order and Morphology Sensitivity in Multilingual Masked LMs
Anna Feldman, Libby Barak, Jing Peng · Feb 28, 2026 · Citations: 0
LLM-Bootstrapped Targeted Finding Guidance for Factual MLLM-based Medical Report Generation
Cunyuan Yang, Dejuan Song, Xiaotao Pang, Qianqian Shen, Wenjie Nie · Feb 28, 2026 · Citations: 0
Policy Compliance of User Requests in Natural Language for AI Systems
Pedro Cisneros-Velarde · Feb 27, 2026 · Citations: 0
Distribution-Aware Companding Quantization of Large Language Models
Athul Radhakrishnan, Siddhant Mohan, Mahima Sachdeva · Feb 27, 2026 · Citations: 0
How Large Language Models Get Stuck: Early structure with persistent errors
Alokesh Manna, William Snyder, Whitney Tabor · Feb 27, 2026 · Citations: 0

Pairwise Preference

We trained Meta's OPT model on the 100M word BabyLM dataset, and evaluated it on the BLiMP benchmark, which consists of 67 classes, each defined by sentence pairs that differ in a targeted syntactic or semantic rule violation.
Brittlebench: Quantifying LLM robustness via prompt sensitivity
Angelika Romanou, Mark Ibrahim, Candace Ross, Chantal Shaib, Kerem Oktar · Feb 27, 2026 · Citations: 0
When Metrics Disagree: Automatic Similarity vs. LLM-as-a-Judge for Clinical Dialogue Evaluation
Bian Sun, Zhenjian Wang, Orvill de la Torre, Zirui Wang · Feb 27, 2026 · Citations: 0

Due to the resource-intensive nature of large-scale human validation, the model's performance was evaluated through a dual-track framework: Track A utilized traditional lexical similarity metrics (e.g., BLEU, ROUGE), while Track B employed…
From Prerequisites to Predictions: Validating a Geometric Hallucination Taxonomy Through Controlled Induction
Matic Korun · Feb 27, 2026 · Citations: 0
Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning
Xintong Li, Sha Li, Rongmei Lin, Hongye Jin, Linwei Li · Feb 27, 2026 · Citations: 0

Long Horizon

Large reasoning models improve with more test-time computation, but often overthink, producing unnecessarily long chains-of-thought that raise cost without improving accuracy.
Transformers Remember First, Forget Last: Dual-Process Interference in LLMs
Sourav Chattaraj, Kanak Raj · Feb 27, 2026 · Citations: 0

Every model shows the same pattern: proactive interference (PI) dominates retroactive interference (RI) universally (Cohen's d = 1.73, p < 0.0001), meaning early encodings are protected at the cost of recent information -- the opposite of…
DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science
Fan Shu, Yite Wang, Ruofan Wu, Boyi Liu, Zhewei Yao · Feb 27, 2026 · Citations: 0

Long Horizon

The fast-growing demands in using Large Language Models (LLMs) to tackle complex multi-step data science tasks create an emergent need for accurate benchmarking.
Do LLMs Benefit From Their Own Words?
Jenny Y. Huang, Leshem Choshen, Ramon Astudillo, Tamara Broderick, Jacob Andreas · Feb 27, 2026 · Citations: 0
Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation
Zhengbo Wang, Jian Liang, Ran He, Zilei Wang, Tieniu Tan · Feb 27, 2026 · Citations: 0
A Minimal Agent for Automated Theorem Proving
Borja Requena, Austin Letson, Krystian Nowakowski, Izan Beltran Ferreiro, Leopoldo Sarra · Feb 27, 2026 · Citations: 0
Controllable Reasoning Models Are Private Thinkers
Haritz Puerto, Haonan Li, Xudong Han, Timothy Baldwin, Iryna Gurevych · Feb 27, 2026 · Citations: 0
Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume
Gregory Kang Ruey Lau, Hieu Dao, Nicole Kan Hui Lin, Bryan Kian Hsiang Low · Feb 27, 2026 · Citations: 0
MT-PingEval: Evaluating Multi-Turn Collaboration with Private Information Games
Jacob Eisenstein, Fantine Huot, Adam Fisch, Jonathan Berant, Mirella Lapata · Feb 27, 2026 · Citations: 0
Task-Centric Acceleration of Small-Language Models
Dor Tsur, Sharon Adar, Ran Levy · Feb 27, 2026 · Citations: 0
ArgLLM-App: An Interactive System for Argumentative Reasoning with Large Language Models
Adam Dejl, Deniz Gorur, Francesca Toni · Feb 27, 2026 · Citations: 0

Demonstrations

Argumentative LLMs (ArgLLMs) are an existing approach leveraging Large Language Models (LLMs) and computational argumentation for decision-making, with the aim of making the resulting decisions faithfully explainable to and contestable by…
CoME: Empowering Channel-of-Mobile-Experts with Informative Hybrid-Capabilities Reasoning
Yuxuan Liu, Weikai Xu, Kun Huang, Changyu Chen, Jiankun Zhao · Feb 27, 2026 · Citations: 0

Mobile Agents can autonomously execute user instructions, which requires hybrid-capabilities reasoning, including screen summary, subtask planning, action decision and action function.
AgenticOCR: Parsing Only What You Need for Efficient Retrieval-Augmented Generation
Zhengren Wang, Dongsheng Ma, Huaping Zhong, Jiayu Li, Wentao Zhang · Feb 27, 2026 · Citations: 0
Terminology Rarity Predicts Catastrophic Failure in LLM Translation of Low-Resource Ancient Languages: Evidence from Ancient Greek
James L. Zainaldin, Cameron Pattison, Manuela Marai, Jacob Wu, Mark J. Schiefsky · Feb 27, 2026 · Citations: 0

This study presents the first systematic, reference-free human evaluation of large language model (LLM) machine translation (MT) for Ancient Greek (AG) technical prose.
Toward Guarantees for Clinical Reasoning in Vision Language Models via Formal Verification
Vikash Singh, Debargha Ganguly, Haotian Yu, Chengwei Zhou, Prerna Singh · Feb 27, 2026 · Citations: 0
Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance
Yanwei Ren, Haotian Zhang, Likang Xiao, Xikai Zhang, Jiaxing Huang · Feb 27, 2026 · Citations: 0

Long Horizon

To address these issues, we propose SCOPE (Step-wise Correction for On-Policy Exploration), a novel framework that utilizes Process Reward Models to pinpoint the first erroneous step in suboptimal rollouts and applies fine-grained,…
ARGUS: Seeing the Influence of Narrative Features on Persuasion in Argumentative Texts
Sara Nabhani, Federico Pianzola, Khalid Al-Khatib, Malvina Nissim · Feb 27, 2026 · Citations: 0
Preference Packing: Efficient Preference Optimization for Large Language Models
Jaekyung Cho · Feb 27, 2026 · Citations: 0

Pairwise Preference

We propose preference packing, a method to enhance resource efficiency in training techniques that use data with different responses for the same input prompt, such as reward models or Direct Preference Optimization (DPO).
SongSong: A Time Phonograph for Chinese SongCi Music from Thousand of Years Away
Jiajia Li, Jiliang Hu, Ziyi Pan, Chong Chen, Zuchao Li · Feb 27, 2026 · Citations: 0
A Novel Hierarchical Multi-Agent System for Payments Using LLMs
Joon Kiat Chua, Donghao Huang, Zhaoxia Wang · Feb 27, 2026 · Citations: 0
Task Complexity Matters: An Empirical Study of Reasoning in LLMs for Sentiment Analysis
Donghao Huang, Zhaoxia Wang · Feb 27, 2026 · Citations: 0
CIRCLE: A Framework for Evaluating AI from a Real-World Lens
Reva Schwartz, Carina Westling, Morgan Briggs, Marzieh Fadaee, Isar Nejadgholi · Feb 27, 2026 · Citations: 0
Data Driven Optimization of GPU efficiency for Distributed LLM Adapter Serving
Ferran Agullo, Joan Oliveras, Chen Wang, Alberto Gutierrez-Torre, Olivier Tardieu · Feb 27, 2026 · Citations: 0
RewardUQ: A Unified Framework for Uncertainty-Aware Reward Models
Daniel Yang, Samuel Stante, Florian Redhardt, Lena Libon, Parnian Kassraie · Feb 27, 2026 · Citations: 0

Pairwise Preference

Reward models are central to aligning large language models (LLMs) with human preferences.
Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking
Zhicheng Fang, Jingjie Zheng, Chenxu Fu, Wei Xu · Feb 27, 2026 · Citations: 0

Red Team Multi Agent

Jailbreak techniques for large language models (LLMs) evolve faster than benchmarks, making robustness estimates stale and difficult to compare across papers due to drift in datasets, harnesses, and judging protocols.
Dialect and Gender Bias in YouTube's Spanish Captioning System
Iris Dania Jimenez, Christoph Kern · Feb 27, 2026 · Citations: 0
The GRADIEND Python Package: An End-to-End System for Gradient-Based Feature Learning
Jonathan Drechsel, Steffen Herbold · Feb 27, 2026 · Citations: 0

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now