HFEPX Archive Slice

HFEPX Daily Archive: 2026-02-27

Updated from current HFEPX corpus (Mar 8, 2026). 51 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Mar 8, 2026). 51 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Calibration. Frequently cited benchmark: AdvBench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 27, 2026.

Papers: 51 Last published: Feb 27, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

High-Signal Coverage

100.0%

51 / 51 papers are not low-signal flagged.

Benchmark Anchors

3.9%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

13.7%

Papers with reported metric mentions in extraction output.

1 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Why This Time Slice Matters

11.8% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 11.8% of papers in this hub.
AdvBench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is rater calibration (2% of papers).
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking
Feb 27, 2026 · Citations: 0 · Score: 7.5

Eval: Llm As Judge · Metrics: Success rate, Jailbreak success rate
RewardUQ: A Unified Framework for Uncertainty-Aware Reward Models
Feb 27, 2026 · Citations: 0 · Score: 7.0

Eval: Automatic Metrics · Metrics: Accuracy
DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science
Feb 27, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Accuracy
When Metrics Disagree: Automatic Similarity vs. LLM-as-a-Judge for Clinical Dialogue Evaluation
Feb 27, 2026 · Citations: 0 · Score: 5.0

Eval: Llm As Judge, Automatic Metrics · Metrics: Accuracy, Precision
Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning
Feb 27, 2026 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: Accuracy, Cost
Terminology Rarity Predicts Catastrophic Failure in LLM Translation of Low-Resource Ancient Languages: Evidence from Ancient Greek
Feb 27, 2026 · Citations: 0 · Score: 5.0

Eval: Human Eval, Automatic Metrics · Metrics: Bleu, Rouge

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking Feb 27, 2026	Llm As Judge	AdvBench, Jbf Eval	Success rate, Jailbreak success rate	Not reported
RewardUQ: A Unified Framework for Uncertainty-Aware Reward Models Feb 27, 2026	Automatic Metrics	Not reported	Accuracy	Calibration
DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science Feb 27, 2026	Automatic Metrics	Dare Bench	Accuracy	Not reported
When Metrics Disagree: Automatic Similarity vs. LLM-as-a-Judge for Clinical Dialogue Evaluation Feb 27, 2026	Llm As Judge, Automatic Metrics	Not reported	Accuracy, Precision	Not reported
Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning Feb 27, 2026	Automatic Metrics	Not reported	Accuracy, Cost	Not reported
Terminology Rarity Predicts Catastrophic Failure in LLM Translation of Low-Resource Ancient Languages: Evidence from Ancient Greek Feb 27, 2026	Human Eval, Automatic Metrics	Not reported	Bleu, Rouge	Not reported
Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance Feb 27, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
ArgLLM-App: An Interactive System for Argumentative Reasoning with Large Language Models Feb 27, 2026	Not reported	Not reported	Not reported	Not reported
Preference Packing: Efficient Preference Optimization for Large Language Models Feb 27, 2026	Not reported	Not reported	Not reported	Not reported
From Static Benchmarks to Dynamic Protocol: Agent-Centric Text Anomaly Detection for Evaluating LLM Reasoning Feb 27, 2026	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (11.8% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (2% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (13.7% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (49% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (3.9% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (9.8% vs 35% target).

Strengths

Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.

Known Gaps

Only 2% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (3.9% coverage).
Annotation unit is under-specified (9.8% coverage).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (AdvBench vs consolidation-retrieval) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

Judge vs Human Agreement Benchmark Slice: AdvBench Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 2% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (3.9% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (6)
Llm As Judge (2)
Human Eval (1)

Top Metrics

Accuracy (11)
Cost (6)
Precision (6)
F1 (4)

Top Benchmarks

AdvBench (1)
Consolidation Retrieval (1)
Dare Bench (1)
DROP (1)

Quality Controls

Calibration (1)

Papers In This Archive Slice

Policy Compliance of User Requests in Natural Language for AI Systems
Pedro Cisneros-Velarde · Feb 27, 2026 · Citations: 0
Distribution-Aware Companding Quantization of Large Language Models
Athul Radhakrishnan, Siddhant Mohan, Mahima Sachdeva · Feb 27, 2026 · Citations: 0
How Large Language Models Get Stuck: Early structure with persistent errors
Alokesh Manna, William Snyder, Whitney Tabor · Feb 27, 2026 · Citations: 0
When Metrics Disagree: Automatic Similarity vs. LLM-as-a-Judge for Clinical Dialogue Evaluation
Bian Sun, Zhenjian Wang, Orvill de la Torre, Zirui Wang · Feb 27, 2026 · Citations: 0

This paper details the baseline model selection, fine-tuning process, evaluation methods, and the implications of deploying more accurate LLMs in healthcare settings.
From Prerequisites to Predictions: Validating a Geometric Hallucination Taxonomy Through Controlled Induction
Matic Korun · Feb 27, 2026 · Citations: 0
Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning
Xintong Li, Sha Li, Rongmei Lin, Hongye Jin, Linwei Li · Feb 27, 2026 · Citations: 0

Long Horizon

Large reasoning models improve with more test-time computation, but often overthink, producing unnecessarily long chains-of-thought that raise cost without improving accuracy.
Transformers Remember First, Forget Last: Dual-Process Interference in LLMs
Sourav Chattaraj, Kanak Raj · Feb 27, 2026 · Citations: 0
DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science
Fan Shu, Yite Wang, Ruofan Wu, Boyi Liu, Zhewei Yao · Feb 27, 2026 · Citations: 0

Long Horizon

The fast-growing demands in using Large Language Models (LLMs) to tackle complex multi-step data science tasks create an emergent need for accurate benchmarking.
Do LLMs Benefit From Their Own Words?
Jenny Y. Huang, Leshem Choshen, Ramon Astudillo, Tamara Broderick, Jacob Andreas · Feb 27, 2026 · Citations: 0
Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation
Zhengbo Wang, Jian Liang, Ran He, Zilei Wang, Tieniu Tan · Feb 27, 2026 · Citations: 0
Controllable Reasoning Models Are Private Thinkers
Haritz Puerto, Haonan Li, Xudong Han, Timothy Baldwin, Iryna Gurevych · Feb 27, 2026 · Citations: 0
Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume
Gregory Kang Ruey Lau, Hieu Dao, Nicole Kan Hui Lin, Bryan Kian Hsiang Low · Feb 27, 2026 · Citations: 0
MT-PingEval: Evaluating Multi-Turn Collaboration with Private Information Games
Jacob Eisenstein, Fantine Huot, Adam Fisch, Jonathan Berant, Mirella Lapata · Feb 27, 2026 · Citations: 0
Task-Centric Acceleration of Small-Language Models
Dor Tsur, Sharon Adar, Ran Levy · Feb 27, 2026 · Citations: 0
ArgLLM-App: An Interactive System for Argumentative Reasoning with Large Language Models
Adam Dejl, Deniz Gorur, Francesca Toni · Feb 27, 2026 · Citations: 0

Demonstrations

Argumentative LLMs (ArgLLMs) are an existing approach leveraging Large Language Models (LLMs) and computational argumentation for decision-making, with the aim of making the resulting decisions faithfully explainable to and contestable by…
CoME: Empowering Channel-of-Mobile-Experts with Informative Hybrid-Capabilities Reasoning
Yuxuan Liu, Weikai Xu, Kun Huang, Changyu Chen, Jiankun Zhao · Feb 27, 2026 · Citations: 0
AgenticOCR: Parsing Only What You Need for Efficient Retrieval-Augmented Generation
Zhengren Wang, Dongsheng Ma, Huaping Zhong, Jiayu Li, Wentao Zhang · Feb 27, 2026 · Citations: 0
Terminology Rarity Predicts Catastrophic Failure in LLM Translation of Low-Resource Ancient Languages: Evidence from Ancient Greek
James L. Zainaldin, Cameron Pattison, Manuela Marai, Jacob Wu, Mark J. Schiefsky · Feb 27, 2026 · Citations: 0

This study presents the first systematic, reference-free human evaluation of large language model (LLM) machine translation (MT) for Ancient Greek (AG) technical prose.
Toward Guarantees for Clinical Reasoning in Vision Language Models via Formal Verification
Vikash Singh, Debargha Ganguly, Haotian Yu, Chengwei Zhou, Prerna Singh · Feb 27, 2026 · Citations: 0
Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance
Yanwei Ren, Haotian Zhang, Likang Xiao, Xikai Zhang, Jiaxing Huang · Feb 27, 2026 · Citations: 0

Long Horizon

To address these issues, we propose SCOPE (Step-wise Correction for On-Policy Exploration), a novel framework that utilizes Process Reward Models to pinpoint the first erroneous step in suboptimal rollouts and applies fine-grained,…
ARGUS: Seeing the Influence of Narrative Features on Persuasion in Argumentative Texts
Sara Nabhani, Federico Pianzola, Khalid Al-Khatib, Malvina Nissim · Feb 27, 2026 · Citations: 0
Preference Packing: Efficient Preference Optimization for Large Language Models
Jaekyung Cho · Feb 27, 2026 · Citations: 0

Pairwise Preference

We propose preference packing, a method to enhance resource efficiency in training techniques that use data with different responses for the same input prompt, such as reward models or Direct Preference Optimization (DPO).
SongSong: A Time Phonograph for Chinese SongCi Music from Thousand of Years Away
Jiajia Li, Jiliang Hu, Ziyi Pan, Chong Chen, Zuchao Li · Feb 27, 2026 · Citations: 0
A Novel Hierarchical Multi-Agent System for Payments Using LLMs
Joon Kiat Chua, Donghao Huang, Zhaoxia Wang · Feb 27, 2026 · Citations: 0
Task Complexity Matters: An Empirical Study of Reasoning in LLMs for Sentiment Analysis
Donghao Huang, Zhaoxia Wang · Feb 27, 2026 · Citations: 0
Data Driven Optimization of GPU efficiency for Distributed LLM Adapter Serving
Ferran Agullo, Joan Oliveras, Chen Wang, Alberto Gutierrez-Torre, Olivier Tardieu · Feb 27, 2026 · Citations: 0
RewardUQ: A Unified Framework for Uncertainty-Aware Reward Models
Daniel Yang, Samuel Stante, Florian Redhardt, Lena Libon, Parnian Kassraie · Feb 27, 2026 · Citations: 0

Pairwise Preference

Reward models are central to aligning large language models (LLMs) with human preferences.
Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking
Zhicheng Fang, Jingjie Zheng, Chenxu Fu, Wei Xu · Feb 27, 2026 · Citations: 0

Red Team Multi Agent

Jailbreak techniques for large language models (LLMs) evolve faster than benchmarks, making robustness estimates stale and difficult to compare across papers due to drift in datasets, harnesses, and judging protocols.
Dialect and Gender Bias in YouTube's Spanish Captioning System
Iris Dania Jimenez, Christoph Kern · Feb 27, 2026 · Citations: 0
The GRADIEND Python Package: An End-to-End System for Gradient-Based Feature Learning
Jonathan Drechsel, Steffen Herbold · Feb 27, 2026 · Citations: 0
MemEmo: Evaluating Emotion in Memory Systems of Agents
Peng Liu, Zhen Tao, Jihao Zhao, Ding Chen, Yansong Zhang · Feb 27, 2026 · Citations: 0
EDDA-Coordinata: An Annotated Dataset of Historical Geographic Coordinates
Ludovic Moncla, Pierre Nugues, Thierry Joliveau, Katherine McDonough · Feb 27, 2026 · Citations: 0
Benchmarking BERT-based Models for Sentence-level Topic Classification in Nepali Language
Nischal Karki, Bipesh Subedi, Prakash Poudyal, Rupak Raj Ghimire, Bal Krishna Bal · Feb 27, 2026 · Citations: 0
The Astonishing Ability of Large Language Models to Parse Jabberwockified Language
Gary Lupyan, Senyi Yang · Feb 27, 2026 · Citations: 0
Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks
Qihua Dong, Kuo Yang, Lin Ju, Handong Zhao, Yitian Zhang · Feb 27, 2026 · Citations: 0
LK Losses: Direct Acceptance Rate Optimization for Speculative Decoding
Alexander Samarin, Sergei Krutikov, Anton Shevtsov, Sergei Skvortsov, Filipp Fisin · Feb 27, 2026 · Citations: 0
SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale
Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, Alexander Golubev · Feb 27, 2026 · Citations: 0
NAU-QMUL: Utilizing BERT and CLIP for Multi-modal AI-Generated Image Detection
Xiaoyu Guo, Arkaitz Zubiaga · Feb 27, 2026 · Citations: 0
CLFEC: A New Task for Unified Linguistic and Factual Error Correction in paragraph-level Chinese Professional Writing
Jian Kai, Zidong Zhang, Jiwen Chen, Zhengxiang Wu, Songtao Sun · Feb 27, 2026 · Citations: 0
GLUScope: A Tool for Analyzing GLU Neurons in Transformer Language Models
Sebastian Gerstner, Hinrich Schütze · Feb 27, 2026 · Citations: 0
Divide and Conquer: Accelerating Diffusion-Based Large Language Models via Adaptive Parallel Decoding
Xiangzhong Luo, Yilin An, Zhicheng Yu, Weichen Liu, Xu Yang · Feb 27, 2026 · Citations: 0
Structured Prompt Optimization for Few-Shot Text Classification via Semantic Alignment in Latent Space
Jiasen Zheng, Zijun Zhou, Huajun Zhang, Junjiang Lin, Jingyun Jia · Feb 27, 2026 · Citations: 0
UTPTrack: Towards Simple and Unified Token Pruning for Visual Tracking
Hao Wu, Xudong Wang, Jialiang Zhang, Junlong Tong, Xinghao Chen · Feb 27, 2026 · Citations: 0
From Static Benchmarks to Dynamic Protocol: Agent-Centric Text Anomaly Detection for Evaluating LLM Reasoning
Seungdong Yoa, Sanghyu Yoon, Suhee Yoon, Dongmin Kim, Ye Seul Sim · Feb 27, 2026 · Citations: 0

Pairwise Preference

To overcome these limitations, we propose an agent-centric benchmarking paradigm that moves beyond static datasets by introducing a dynamic protocol in which autonomous agents iteratively generate, validate, and solve problems.
Your Inference Request Will Become a Black Box: Confidential Inference for Cloud-based Large Language Models
Chung-ju Huang, Huiqiang Zhao, Yuanpeng He, Lijian Li, Wenpin Jiao · Feb 27, 2026 · Citations: 0
HiDrop: Hierarchical Vision Token Reduction in MLLMs via Late Injection, Concave Pyramid Pruning, and Early Exit
Hao Wu, Yingqi Fan, Jinyang Dai, Junlong Tong, Yunpu Ma · Feb 27, 2026 · Citations: 0
TRIZ-RAGNER: A Retrieval-Augmented Large Language Model for TRIZ-Aware Named Entity Recognition in Patent-Based Contradiction Mining
Zitong Xu, Yuqing Wu, Yue Zhao · Feb 27, 2026 · Citations: 0
LLM-Driven Multi-Turn Task-Oriented Dialogue Synthesis for Realistic Reasoning
Yu Zhu, Kai Yang · Feb 27, 2026 · Citations: 0
LFQA-HP-1M: A Large-Scale Human Preference Dataset for Long-Form Question Answering
Rafid Ishrak Jahan, Fahmid Shahriar Iqbal, Sagnik Ray Choudhury · Feb 27, 2026 · Citations: 0

Pairwise PreferenceRubric Rating

We present LFQA-HP-1M, a large-scale dataset comprising 1.3M human pairwise preference annotations for LFQA.
BRIDGE the Gap: Mitigating Bias Amplification in Automated Scoring of English Language Learners via Inter-group Data Augmentation
Yun Wang, Xuansheng Wu, Jingyuan Huang, Lei Liu, Xiaoming Zhai · Feb 27, 2026 · Citations: 0
Multi-Agent Causal Reasoning for Suicide Ideation Detection Through Online Conversations
Jun Li, Xiangmeng Wang, Haoyang Li, Yifei Yan, Shijie Zhang · Feb 27, 2026 · Citations: 0

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote