HFEPX Archive Slice

HFEPX Daily Archive: 2026-02-20

Updated from current HFEPX corpus (Apr 12, 2026). 38 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 38 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequent quality control: Gold Questions. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 20, 2026.

Papers: 38 Last published: Feb 20, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

High-Signal Coverage

100.0%

38 / 38 papers are not low-signal flagged.

Benchmark Anchors

15.8%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

57.9%

Papers with reported metric mentions in extraction output.

1 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.

Get this digest every Friday →

Why This Time Slice Matters

13.2% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 52.6% of papers in this hub.
long-horizon tasks appears in 2.6% of papers, indicating agentic evaluation demand.

Protocol Takeaways For This Period

Most common quality-control signal is gold-question checks (2.6% of papers).
Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Validating Political Position Predictions of Arguments
Feb 20, 2026 · Citations: 0 · Score: 6.5

Eval: Human Eval · Metrics: Agreement
SPQ: An Ensemble Technique for Large Language Model Compression
Feb 20, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy, Perplexity
Agentic Adversarial QA for Improving Domain-Specific LLMs
Feb 20, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy, Recall
Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering
Feb 20, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Recall, Coherence
FENCE: A Financial and Multimodal Jailbreak Detection Dataset
Feb 20, 2026 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Accuracy
Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards
Feb 20, 2026 · Citations: 0 · Score: 5.5

Eval: Llm As Judge, Automatic Metrics · Metrics: Accuracy, Win rate

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Validating Political Position Predictions of Arguments Feb 20, 2026	Human Eval	Not reported	Agreement	Gold Questions, Inter Annotator Agreement Reported
SPQ: An Ensemble Technique for Large Language Model Compression Feb 20, 2026	Automatic Metrics	GSM8K, TruthfulQA	Accuracy, Perplexity	Not reported
Agentic Adversarial QA for Improving Domain-Specific LLMs Feb 20, 2026	Automatic Metrics	Legalbench	Accuracy, Recall	Not reported
Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering Feb 20, 2026	Automatic Metrics	Financebench	Recall, Coherence	Not reported
FENCE: A Financial and Multimodal Jailbreak Detection Dataset Feb 20, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards Feb 20, 2026	Llm As Judge, Automatic Metrics	Not reported	Accuracy, Win rate	Not reported
CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications Feb 20, 2026	Automatic Metrics	Not reported	Precision, Recall	Not reported
PolyFrame at MWE-2026 AdMIRe 2: When Words Are Not Enough: Multimodal Idiom Disambiguation Feb 20, 2026	Automatic Metrics	Not reported	Ndcg	Not reported
Luna-2: Scalable Single-Token Evaluation with Small Language Models Feb 20, 2026	Llm As Judge, Automatic Metrics	Not reported	Accuracy, Latency	Not reported
VIRAASAT: Traversing Novel Paths for Indian Cultural Reasoning Feb 20, 2026	Automatic Metrics	Not reported	Recall	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (13.2% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (2.6% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (0% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (18.4% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (13.2% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (10.5% vs 35% target).

Strengths

Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.

Known Gaps

Only 2.6% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (13.2% coverage).
Annotation unit is under-specified (10.5% coverage).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Track metric sensitivity by reporting both accuracy and agreement.

Recommended Queries

Judge vs Human Agreement Metric Slice: accuracy IAA-Reported Evaluations Recent High-Signal Papers

Known Limitations

Only 2.6% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (13.2% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (20)
Human Eval (2)
Llm As Judge (2)
Simulation Env (2)

Top Metrics

Accuracy (4)
Agreement (2)
F1 (2)
Precision (2)

Top Benchmarks

Quality Controls

Gold Questions (1)
Inter Annotator Agreement Reported (1)

Papers In This Archive Slice

PolyFrame at MWE-2026 AdMIRe 2: When Words Are Not Enough: Multimodal Idiom Disambiguation
Nina Hosseini-Kivanani · Feb 20, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
DP-RFT: Learning to Generate Synthetic Text via Differentially Private Reinforcement Fine-Tuning
Fangyuan Xu, Sihao Chen, Zinan Lin, Taiwei Shi, Sydney Graham · Feb 20, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Diagnosing LLM Reranker Behavior Under Fixed Evidence Pools
Baris Arat, Emre Sefer · Feb 20, 2026 · Citations: 0

Standard reranking evaluations study how a reranker orders candidates returned by an upstream retriever.
Luna-2: Scalable Single-Token Evaluation with Small Language Models
Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel · Feb 20, 2026 · Citations: 0

We present Luna-2, a novel architecture that leverages decoder-only small language models (SLMs) into a deterministic evaluation model to reliably compute complex task-specific LLMAJ metrics (e.g.
Hierarchical Reward Design from Language: Enhancing Alignment of Agent Behavior with Human Specifications
Zhiqin Qian, Ryan Diaz, Sangwon Seo, Vaibhav Unhelkar · Feb 20, 2026 · Citations: 0

Pairwise Preference Long Horizon

When training artificial intelligence (AI) to perform tasks, humans often care not only about whether a task is completed but also how it is performed.
VIRAASAT: Traversing Novel Paths for Indian Cultural Reasoning
Harshul Raj Surana, Arijit Maji, Aryan Vats, Akash Ghosh, Sriparna Saha · Feb 20, 2026 · Citations: 0

Existing Cultural benchmarks are (i) Manually crafted, (ii) contain single-hop questions testing factual recall, and (iii) prohibitively costly to scale, leaving this deficiency largely unmeasured.
RVR: Retrieve-Verify-Retrieve for Comprehensive Question Answering
Deniz Qian, Hung-Ting Chen, Eunsol Choi · Feb 20, 2026 · Citations: 0

Our method outperforms baselines, including agentic search approaches, achieving at least 10% relative and 3% absolute gain in complete recall percentage on a multi-answer retrieval dataset (QAMPARI).
SPQ: An Ensemble Technique for Large Language Model Compression
Jiamin Yao, Eren Gultepe · Feb 20, 2026 · Citations: 0

Applied to LLaMA-2-7B, SPQ achieves up to 75% memory reduction while maintaining or improving perplexity (e.g., WikiText-2 5.47 to 4.91) and preserving accuracy on downstream benchmarks such as C4, TruthfulQA, and GSM8K.
Subgroups of $U(d)$ Induce Natural RNN and Transformer Architectures
Joshua Nunley · Feb 20, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Validating Political Position Predictions of Arguments
Jordan Robinson, Angus R. Williams, Katie Atkinson, Anthony G. Cohn · Feb 20, 2026 · Citations: 0

Pairwise Preference

Real-world knowledge representation often requires capturing subjective, continuous attributes -- such as political positions -- that conflict with pairwise validation, the widely accepted gold standard for human evaluation.
Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System
Pavithra PM Nair, Preethu Rose Anish · Feb 20, 2026 · Citations: 0

Vichara surpasses existing judgment prediction benchmarks on both datasets, with GPT-4o mini achieving the highest performance (F1: 81.5 on PredEx, 80.3 on ILDC_expert), followed by Llama-3.1-8B.
On the "Induction Bias" in Sequence Models
M. Reza Ebrahimi, Michaël Defferrard, Sunny Panchal, Roland Memisevic · Feb 20, 2026 · Citations: 0

Together, these results demonstrate that state tracking remains a fundamental challenge for transformers, even when training and evaluation distributions match.
Predicting Contextual Informativeness for Vocabulary Learning using Deep Learning
Tao Wu, Adam Kapelner · Feb 20, 2026 · Citations: 0

In summary, we demonstrate that a modern embedding model on neural network architecture, when guided by human supervision, results in a low-cost large supply of near-perfect contexts for teaching vocabulary for a variety of target words.
PsihoRo: Depression and Anxiety Romanian Text Corpus
Alexandra Ciobotaru, Ana-Maria Bucur, Liviu P. Dinu · Feb 20, 2026 · Citations: 0

Psychological corpora in NLP are collections of texts used to analyze human psychology, emotions, and mental health.
VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean
Yutong Xin, Qiaochu Chen, Greg Durrett, Işil Dillig · Feb 20, 2026 · Citations: 0

However, most benchmarks for LLM-based proof automation are drawn from mathematics in the Mathlib ecosystem, whereas proofs in software verification are developed inside definition-rich codebases with substantial project-specific libraries.
On the Semantic and Syntactic Information Encoded in Proto-Tokens for One-Step Text Reconstruction
Ivan Bondarenko, Egor Palkin, Fedor Tikunov · Feb 20, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory
Usman Anwar, Tim Bakker, Dana Kianfar, Cristina Pinneri, Christos Louizos · Feb 20, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Context-Aware Mapping of 2D Drawing Annotations to 3D CAD Features Using LLM-Assisted Reasoning for Manufacturing Automation
Muhammad Tayyab Khan, Lequn Chen, Wenhe Feng, Seung Ki Moon · Feb 20, 2026 · Citations: 0

When deterministic scoring cannot resolve an ambiguity, the system escalates to multimodal and constrained large-language-model reasoning, followed by a single human-in-the-loop (HITL) review step.
Simplifying Outcomes of Language Model Component Analyses with ELIA
Aaron Louis Eidt, Nils Feldhus · Feb 20, 2026 · Citations: 0

Pairwise Preference

The effectiveness of this approach was empirically validated through a mixed-methods user study, which revealed a clear preference for interactive, explorable interfaces over simpler, static visualizations.
Thinking by Subtraction: Confidence-Driven Contrastive Decoding for LLM Reasoning
Lexiang Tang, Weihao Gao, Bingchen Zhao, Lu Ma, Qiao jin · Feb 20, 2026 · Citations: 0

Experiments show that CCD significantly improves accuracy across mathematical reasoning benchmarks while substantially reducing output length, with minimal KV-cache overhead.
Information-Theoretic Storage Cost in Sentence Comprehension
Kohei Kajikawa, Shinnosuke Isono, Ethan Gotlieb Wilcox · Feb 20, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Improving Sampling for Masked Diffusion Models via Information Gain
Kaisen Yang, Jayden Teoh, Kaicheng Yang, Yitong Zhang, Alex Lamb · Feb 20, 2026 · Citations: 0

Extensive evaluations across diverse architectures and tasks (reasoning, coding, creative writing, and image generation) demonstrate that Info-Gain Sampler consistently outperforms existing samplers for MDMs.
Click it or Leave it: Detecting and Spoiling Clickbait with Informativeness Measures and Large Language Models
Wojciech Michaluk, Tymoteusz Urban, Mateusz Kubita, Soveatin Kuntur, Anna Wroblewska · Feb 20, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
FENCE: A Financial and Multimodal Jailbreak Detection Dataset
Mirae Kim, Seonghun Jeong, Youngjun Kwak · Feb 20, 2026 · Citations: 0

Red Team

A baseline detector trained on FENCE achieves 99 percent in-distribution accuracy and maintains strong performance on external benchmarks, underscoring the dataset's robustness for training reliable detection models.
The Statistical Signature of LLMs
Ortal Hadad, Edoardo Loru, Jacopo Nudo, Niccolò Di Marco, Matteo Cinelli · Feb 20, 2026 · Citations: 0

We analyze compression behavior across three progressively more complex information ecosystems: controlled human-LLM continuations, generative mediation of a knowledge infrastructure (Wikipedia vs.
Detecting Contextual Hallucinations in LLMs with Frequency-Aware Attention
Siya Qi, Yudong Chen, Runcong Zhao, Qinglin Zhu, Zhanghao Hu · Feb 20, 2026 · Citations: 0

Experiments on the RAGTruth and HalluRAG benchmarks show that our approach achieves performance gains over verification-based, internal-representation-based, and attention-based methods across models and tasks.
Agentic Adversarial QA for Improving Domain-Specific LLMs
Vincent Grari, Ciprian Tomoiaga, Sylvain Lamprier, Tatsunori Hashimoto, Marcin Detyniecki · Feb 20, 2026 · Citations: 0

Evaluation on specialized subsets of the LegalBench corpus demonstrates that our method achieves greater accuracy with substantially fewer synthetic samples.
Perceived Political Bias in LLMs Reduces Persuasive Abilities
Matthew DiGiuseppe, Joshua Robison · Feb 20, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards
Johannes Ackermann, Michael Noukhovitch, Takashi Ishida, Masashi Sugiyama · Feb 20, 2026 · Citations: 0

Reinforcement Learning from Human Feedback (RLHF) or Verifiable Rewards (RLVR) are two key steps in the post-training of modern Language Models (LMs).
CogGen: Cognitive-Load-Informed Fully Unsupervised Deep Generative Modeling for Compressively Sampled MRI Reconstruction
Qingyong Zhu, Yumin Tan, Xiang Gu, Dong Liang · Feb 20, 2026 · Citations: 0
Towards More Standardized AI Evaluation: From Models to Agents
Ali El Filali, Inès Bedar · Feb 20, 2026 · Citations: 0

Evaluation is no longer a final checkpoint in the machine learning lifecycle.
NIMMGen: Learning Neural-Integrated Mechanistic Digital Twins with LLMs
Zihan Guan, Rituparna Datta, Mengxuan Hu, Shunshun Liu, Aiying Zhang · Feb 20, 2026 · Citations: 0

Recent work has explored LLM-based agentic frameworks to automatically construct mechanistic models from data; however, existing problem settings substantially oversimplify real-world conditions, leaving it unclear whether LLM-generated…
Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering
Amine Kobeissi, Philippe Langlais · Feb 20, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications
Victoria Blake, Mathew Miller, Jamie Novak, Sze-yuan Ooi, Blanca Gallego · Feb 20, 2026 · Citations: 0

Expert Verification

The framework was evaluated on five lexically heterogeneous clinical concepts against a manually curated benchmark and gold-standard concept sets.
Analyzing LLM Instruction Optimization for Tabular Fact Verification
Xiaotang Du, Giwon Hong, Wai-Chung Kwan, Rohit Saxena, Ivan Titov · Feb 20, 2026 · Citations: 0

We study three optimizers from the DSPy framework -- COPRO, MiPROv2, and SIMBA -- across four benchmarks and three model families.
Condition-Gated Reasoning for Context-Dependent Biomedical Question Answering
Jash Rajesh Parekh, Wonbin Kweon, Joey Chan, Rezarta Islamaj, Robert Leaman · Feb 20, 2026 · Citations: 0

Existing benchmarks do not evaluate such conditional reasoning, and retrieval-augmented or graph-based methods lack explicit mechanisms to ensure that retrieved knowledge is applicable to given context.
Improving Neural Topic Modeling with Semantically-Grounded Soft Label Distributions
Raymond Li, Amirhossein Abaskohi, Chuyuan Li, Gabriel Murray, Giuseppe Carenini · Feb 20, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Games That Teach, Chats That Convince: Comparing Interactive and Static Formats for Persuasive Learning
Seyed Hossein Alavi, Zining Wang, Shruthi Chockkalingam, Raymond T. Ng, Vered Shwartz · Feb 20, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote