HFEPX Archive Slice

HFEPX Daily Archive: 2026-02-21

Updated from current HFEPX corpus (Apr 12, 2026). 23 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 23 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Adjudication. Frequently cited benchmark: AIME. Common metric signal: jailbreak success rate. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 21, 2026.

Papers: 23 Last published: Feb 21, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

High-Signal Coverage

100.0%

23 / 23 papers are not low-signal flagged.

Benchmark Anchors

17.4%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

47.8%

Papers with reported metric mentions in extraction output.

2 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Get this digest every Friday →

Why This Time Slice Matters

13% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 34.8% of papers in this hub.
AIME is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is adjudication (4.3% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language
Feb 21, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Agreement
Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning
Feb 21, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy, Precision
ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models
Feb 21, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy
Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models
Feb 21, 2026 · Citations: 0 · Score: 5.5

Eval: Human Eval · Metrics: Not reported
MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs
Feb 21, 2026 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Success rate, Jailbreak success rate
Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation
Feb 21, 2026 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Error rate, Wer

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language Feb 21, 2026	Automatic Metrics	Not reported	Agreement	Inter Annotator Agreement Reported, Adjudication
Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning Feb 21, 2026	Automatic Metrics	Nyayabench	Accuracy, Precision	Not reported
ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models Feb 21, 2026	Automatic Metrics	Arabicnumbench	Accuracy	Not reported
Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models Feb 21, 2026	Human Eval	GSM8K, AIME	Not reported	Not reported
MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs Feb 21, 2026	Automatic Metrics	Not reported	Success rate, Jailbreak success rate	Not reported
Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation Feb 21, 2026	Automatic Metrics	Not reported	Error rate, Wer	Not reported
MoBiQuant: Mixture-of-Bits Quantization for Token-Adaptive Elastic LLMs Feb 21, 2026	Not reported	Not reported	Precision	Calibration
BURMESE-SAN: Burmese NLP Benchmark for Evaluating Large Language Models Feb 21, 2026	Automatic Metrics	Not reported	Toxicity	Not reported
ReHear: Iterative Pseudo-Label Refinement for Semi-Supervised Speech Recognition via Audio Large Language Models Feb 21, 2026	Automatic Metrics	Not reported	Jailbreak success rate	Not reported
AAVGen: Precision Engineering of Adeno-associated Viral Capsids for Renal Selective Targeting Feb 21, 2026	Not reported	Not reported	Precision	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (13% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (8.7% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (4.3% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (13% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (8.7% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (17.4% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 8.7% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (8.7% coverage).
Annotation unit is under-specified (17.4% coverage).

Suggested Next Analyses

Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.
Stratify by benchmark (AIME vs Correctbench) before comparing methods.
Track metric sensitivity by reporting both jailbreak success rate and agreement.

Recommended Queries

Human Eval Protocols Benchmark Slice: AIME Metric Slice: jailbreak success rate IAA-Reported Evaluations Recent High-Signal Papers

Known Limitations

Only 8.7% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (8.7% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (8)
Human Eval (1)

Top Metrics

Jailbreak success rate (2)
Agreement (1)
Error rate (1)
Success rate (1)

Top Benchmarks

AIME (1)
Correctbench (1)
Cruxeval (1)
GSM8K (1)

Quality Controls

Adjudication (1)
Calibration (1)
Inter Annotator Agreement Reported (1)

Papers In This Archive Slice

Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation
Yonathan Ron, Shiri Gilboa, Tammuz Dubnov · Feb 21, 2026 · Citations: 0

Multi Agent

We introduce Whisper: Courtside Edition, a novel multi-agent large language model (LLM) pipeline that enhances Whisper transcriptions without retraining.
Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language
Toheeb Aduramomi Jimoh, Tabea De Wille, Nikola S. Nikolov · Feb 21, 2026 · Citations: 0

Pairwise Preference

This protocol incorporates context-sensitive interpretation and community-informed guidelines and is accompanied by a comprehensive analysis of inter-annotator agreement to support replication in other African languages.
MoBiQuant: Mixture-of-Bits Quantization for Token-Adaptive Elastic LLMs
Dongwei Wang, Jinhee Kim, Seokho Han, Denis Gudovskiy, Yohei Nakata · Feb 21, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning
Abhinaba Basu · Feb 21, 2026 · Citations: 0

Personal AI agents incur substantial cost via repeated LLM calls.
DeepInnovator: Triggering the Innovative Capabilities of LLMs
Tianyu Fan, Fengji Zhang, Yuxiang Zheng, Bei Chen, Xinyao Niu · Feb 21, 2026 · Citations: 0

The application of Large Language Models (LLMs) in accelerating scientific discovery has garnered increasing attention, with a key focus on constructing research agents endowed with innovative capability, i.e., the ability to autonomously…
AAVGen: Precision Engineering of Adeno-associated Viral Capsids for Renal Selective Targeting
Mohammadreza Ghaffarzadeh-Esfahani, Yousof Gheisari · Feb 21, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
TRUE: A Trustworthy Unified Explanation Framework for Large Language Model Reasoning
Yujiao Yang · Feb 21, 2026 · Citations: 0

Extensive experiments across multiple reasoning benchmarks demonstrate that the proposed framework provides multi-level, verifiable explanations, including executable reasoning structures for individual instances, feasible-region…
[b]=[d]-[t]+[p]: Self-supervised Speech Models Discover Phonological Vector Arithmetic
Kwanghee Choi, Eunjung Yeo, Cheol Jun Cho, David Harwath, David R. Mortensen · Feb 21, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Hyperbolic Busemann Neural Networks
Ziheng Chen, Bernhard Schölkopf, Nicu Sebe · Feb 21, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
EvalSense: A Framework for Domain-Specific LLM (Meta-)Evaluation
Adam Dejl, Jonathan Pearson · Feb 21, 2026 · Citations: 0

Robust and comprehensive evaluation of large language models (LLMs) is essential for identifying effective LLM system configurations and mitigating risks associated with deploying LLMs in sensitive domains.
Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models
Abraham Paul Elenjical, Vivek Hruday Kavuri, Vasudeva Varma · Feb 21, 2026 · Citations: 0

Pairwise Preference

We introduce a psychologically grounded metacognitive framework that operationalizes Ann Brown's regulatory cycle (Planning, Monitoring, and Evaluation) as a structured prompting architecture, and study its integration within a lightweight…
BURMESE-SAN: Burmese NLP Benchmark for Evaluating Large Language Models
Thura Aung, Jann Railey Montalan, Jian Gang Ngui, Peerat Limkonchotiwat · Feb 21, 2026 · Citations: 0

We introduce BURMESE-SAN, the first holistic benchmark that systematically evaluates large language models (LLMs) for Burmese across three core NLP competencies: understanding (NLU), reasoning (NLR), and generation (NLG).
MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs
Chun Yan Ryan Kan, Tommy Tran, Vedant Yadav, Ava Cai, Kevin Zhu · Feb 21, 2026 · Citations: 0

Red Team

We propose MANATEE, an inference-time defense that uses density estimation over a benign representation manifold.
ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models
Anas Alhumud, Abdulaziz Alhammadi, Muhammad Badruddin Khan · Feb 21, 2026 · Citations: 0

We present ArabicNumBench, a comprehensive benchmark for evaluating large language models on Arabic number reading tasks across Eastern Arabic-Indic numerals (0-9 in Arabic script) and Western Arabic numerals (0-9).
The Convergence of Schema-Guided Dialogue Systems and the Model Context Protocol
Andreas Schlapbach · Feb 21, 2026 · Citations: 0

This paper establishes a fundamental convergence: Schema-Guided Dialogue (SGD) and the Model Context Protocol (MCP) represent two manifestations of a unified paradigm for deterministic, auditable LLM-agent interaction.
Rethinking Retrieval-Augmented Generation as a Cooperative Decision-Making Problem
Lichang Song, Ting Long, Yi Chang · Feb 21, 2026 · Citations: 0

Multi Agent

To overcome this limitation, we reformulate RAG as a cooperative multi-agent decision-making problem and propose Cooperative Retrieval-Augmented Generation (CoRAG), a framework in which the reranker and the generator act as peer…
ReHear: Iterative Pseudo-Label Refinement for Semi-Supervised Speech Recognition via Audio Large Language Models
Zefang Liu, Chenyang Zhu, Sangwoo Cho, Shi-Xiong Zhang · Feb 21, 2026 · Citations: 0

Experimental results across diverse benchmarks demonstrate that ReHear effectively mitigates error propagation, consistently outperforming both supervised and pseudo-labeling baselines.
Many AI Analysts, One Dataset: Navigating the Agentic Data Science Multiverse
Martin Bertran, Riccardo Fogliato, Zhiwei Steven Wu · Feb 21, 2026 · Citations: 0
Watermarking LLM Agent Trajectories
Wenlong Meng, Chen Gong, Terry Yue Zhuo, Fan Zhang, Kecen Li · Feb 21, 2026 · Citations: 0

Long Horizon

LLM agents rely heavily on high-quality trajectory data to guide their problem-solving behaviors, yet producing such data requires substantial task design, high-capacity model generation, and manual filtering.
Semantic Substrate Theory: An Operator-Theoretic Framework for Geometric Semantic Drift
Stephen Russell · Feb 21, 2026 · Citations: 0

Long Horizon

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Contradiction to Consensus: Dual Perspective, Multi Source Retrieval Based Claim Verification with Source Level Disagreement using LLM
Md Badsha Biswas, Ozlem Uzuner · Feb 21, 2026 · Citations: 0

Through extensive evaluation on four benchmark datasets with five LLMs, we show that knowledge aggregation not only improves claim verification but also reveals differences in source-specific reasoning.
From Trial by Fire To Sleep Like a Baby: A Lexicon of Anxiety Associations for 20k English Multiword Expressions
Saif M. Mohammad · Feb 21, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Spilled Energy in Large Language Models
Adrian Robert Minut, Hazem Dewidar, Iacopo Masi · Feb 21, 2026 · Citations: 0

Evaluated on nine benchmarks across state-of-the-art LLMs (including LLaMA, Mistral, and Gemma) and on synthetic algebraic operations (Qwen3), our approach demonstrates robust, competitive hallucination detection and cross-task…

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now