HFEPX Archive Slice

HFEPX Daily Archive: 2026-03-07

Updated from current HFEPX corpus (Mar 10, 2026). 23 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Mar 10, 2026). 23 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Scalar. Frequent quality control: Calibration. Common metric signal: cost. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 7, 2026.

Papers: 23 Last published: Mar 7, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

High-Signal Coverage

100.0%

23 / 23 papers are not low-signal flagged.

Benchmark Anchors

4.3%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

39.1%

Papers with reported metric mentions in extraction output.

1 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Get this digest every Friday →

Why This Time Slice Matters

21.7% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 39.1% of papers in this hub.
long-horizon tasks appears in 4.3% of papers, indicating agentic evaluation demand.

Protocol Takeaways For This Period

Most common quality-control signal is rater calibration (4.3% of papers).
Rater context is mostly domain experts, and annotation is commonly scalar scoring; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Taiwan Safety Benchmark and Breeze Guard: Toward Trustworthy AI for Taiwanese Mandarin
Mar 7, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: F1
To Predict or Not to Predict? Towards reliable uncertainty estimation in the presence of noise
Mar 7, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: F1, F1 macro
Can Safety Emerge from Weak Supervision? A Systematic Analysis of Small Language Models
Mar 7, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Helpfulness
SoK: Agentic Retrieval-Augmented Generation (RAG): Taxonomy, Architectures, Evaluation, and Research Directions
Mar 7, 2026 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: Cost
Lying to Win: Assessing LLM Deception through Human-AI Games and Parallel-World Probing
Mar 7, 2026 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: Accuracy
Entropy-Aware On-Policy Distillation of Language Models
Mar 7, 2026 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: Accuracy, Precision

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Taiwan Safety Benchmark and Breeze Guard: Toward Trustworthy AI for Taiwanese Mandarin Mar 7, 2026	Automatic Metrics	Ts Bench	F1	Not reported
To Predict or Not to Predict? Towards reliable uncertainty estimation in the presence of noise Mar 7, 2026	Automatic Metrics	Not reported	F1, F1 macro	Calibration
Can Safety Emerge from Weak Supervision? A Systematic Analysis of Small Language Models Mar 7, 2026	Automatic Metrics	Not reported	Helpfulness	Not reported
SoK: Agentic Retrieval-Augmented Generation (RAG): Taxonomy, Architectures, Evaluation, and Research Directions Mar 7, 2026	Automatic Metrics	Not reported	Cost	Not reported
Lying to Win: Assessing LLM Deception through Human-AI Games and Parallel-World Probing Mar 7, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
Entropy-Aware On-Policy Distillation of Language Models Mar 7, 2026	Automatic Metrics	Not reported	Accuracy, Precision	Not reported
CoTJudger: A Graph-Driven Framework for Automatic Evaluation of Chain-of-Thought Efficiency and Redundancy in LRMs Mar 7, 2026	Automatic Metrics	Not reported	Accuracy, Cost	Not reported
Language-Aware Distillation for Multilingual Instruction-Following Speech LLMs with ASR-Only Supervision Mar 7, 2026	Automatic Metrics	Not reported	Jailbreak success rate	Not reported
A Systematic Investigation of Document Chunking Strategies and Embedding Sensitivity Mar 7, 2026	Automatic Metrics	Not reported	Accuracy, Precision	Not reported
AutoChecklist: Composable Pipelines for Checklist Generation and Scoring with LLM-as-a-Judge Mar 7, 2026	Llm As Judge	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (21.7% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (4.3% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (0% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (8.7% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (17.4% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (17.4% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 4.3% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (17.4% coverage).
Annotation unit is under-specified (17.4% coverage).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Track metric sensitivity by reporting both cost and helpfulness.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

LLM-as-Judge Protocols Metric Slice: cost Recent High-Signal Papers

Known Limitations

Only 4.3% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (17.4% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (9)
Llm As Judge (1)

Top Metrics

Cost (1)
Helpfulness (1)

Top Benchmarks

Quality Controls

Calibration (1)

Papers In This Archive Slice

SoK: Agentic Retrieval-Augmented Generation (RAG): Taxonomy, Architectures, Evaluation, and Research Directions
Saroj Mishra, Suman Niroula, Umesh Yadav, Dilip Thakur, Srijan Gyawali · Mar 7, 2026 · Citations: 0

Long Horizon

Retrieval-Augmented Generation (RAG) systems are increasingly evolving into agentic architectures where large language models autonomously coordinate multi-step reasoning, dynamic memory management, and iterative retrieval strategies.
Domain-Specific Quality Estimation for Machine Translation in Low-Resource Scenarios
Namrata Patil Gurav, Akashdeep Ranu, Archchana Sindhujan, Diptesh Kanojia · Mar 7, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Position: LLMs Must Use Functor-Based and RAG-Driven Bias Mitigation for Fairness
Ravi Ranjan, Utkarsh Grover, Agorista Polyzou · Mar 7, 2026 · Citations: 0

Critique Edit

Biases in large language models (LLMs) often manifest as systematic distortions in associations between demographic attributes and professional or social roles, reinforcing harmful stereotypes across gender, ethnicity, and geography.
RILEC: Detection and Generation of L1 Russian Interference Errors in English Learner Texts
Darya Kharlamova, Irina Proskurina · Mar 7, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
How Much Noise Can BERT Handle? Insights from Multilingual Sentence Difficulty Detection
Nouran Khallaf, Serge Sharoff · Mar 7, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
To Predict or Not to Predict? Towards reliable uncertainty estimation in the presence of noise
Nouran Khallaf, Serge Sharoff · Mar 7, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
The Third Ambition: Artificial Intelligence and the Science of Human Behavior
W. Russell Neuman, Chad Coleman · Mar 7, 2026 · Citations: 0

Contemporary artificial intelligence research has been organized around two dominant ambitions: productivity, which treats AI systems as tools for accelerating work and economic output, and alignment, which focuses on ensuring that…
Taiwan Safety Benchmark and Breeze Guard: Toward Trustworthy AI for Taiwanese Mandarin
Po-Chun Hsu, Meng-Hsi Chen, Tsu Ling Chao, Chia Tien Han, Da-shan Shiu · Mar 7, 2026 · Citations: 0

To address these gaps, we introduce TS-Bench (Taiwan Safety Benchmark), a standardized evaluation suite for assessing safety performance in Taiwanese Mandarin.
Scaling Self-Supervised Speech Models Uncovers Deep Linguistic Relationships: Evidence from the Pacific Cluster
Minu Kim, Hoirin Kim, David R. Mortensen · Mar 7, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Lying to Win: Assessing LLM Deception through Human-AI Games and Parallel-World Probing
Arash Marioriyad, Ali Nouri, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah · Mar 7, 2026 · Citations: 0

As Large Language Models (LLMs) transition into autonomous agentic roles, the risk of deception-defined behaviorally as the systematic provision of false information to satisfy external incentives-poses a significant challenge to AI safety.
Fine-Grained Table Retrieval Through the Lens of Complex Queries
Wojciech Kosiuk, Xingyu Ji, Yeounoh Chung, Fatma Özcan, Madelon Hulsebos · Mar 7, 2026 · Citations: 0

Our analyses over industry-aligned benchmarks illustrate the robustness of DCTR for highly composite queries and densely connected databases.
Emotion Transcription in Conversation: A Benchmark for Capturing Subtle and Complex Emotional States through Natural Language
Yoshiki Tanaka, Ryuichi Uehara, Koji Inoue, Michimasa Inaba · Mar 7, 2026 · Citations: 0

Emotion Recognition in Conversation (ERC) is critical for enabling natural human-machine interactions.
Enhancing Consistency of Werewolf AI through Dialogue Summarization and Persona Information
Yoshiki Tanaka, Takumasa Kaneko, Hiroki Onozeki, Natsumi Ezure, Ryuichi Uehara · Mar 7, 2026 · Citations: 0

In this study, we present a Werewolf AI agent developed for the AIWolfDial 2024 shared task, co-hosted with the 17th INLG.
Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking in RLVR
Muhammad Khalifa, Zohaib Khan, Omer Tafveez, Hao Peng, Lu Wang · Mar 7, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Entropy-Aware On-Policy Distillation of Language Models
Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou · Mar 7, 2026 · Citations: 0

Across six math reasoning benchmarks, this yields Pass@8 accuracy gains of +1.37 for Qwen3-0.6B-Base, +2.39 for Qwen3-1.7B-Base, and +5.05 for Qwen3-4B-Base compared to baseline on-policy distillation methods.
CoTJudger: A Graph-Driven Framework for Automatic Evaluation of Chain-of-Thought Efficiency and Redundancy in LRMs
Siyi Li, Jiajun Shi, Shiwen Ni, Ge Zhang, Shuaimin Li · Mar 7, 2026 · Citations: 0

Existing evaluations largely emphasize final accuracy or coarse token counts, and lack automated tools to separate essential logic from structural redundancy.
Language-Aware Distillation for Multilingual Instruction-Following Speech LLMs with ASR-Only Supervision
Shreyas Gopal, Donghang Wu, Ashutosh Anshul, Yeo Yue Heng, Yizhou Peng · Mar 7, 2026 · Citations: 0

We further synthesize Audio-MLQA, a multilingual spoken QA benchmark built on MLQA with high-quality TTS questions.
Hit-RAG: Learning to Reason with Long Contexts via Preference Alignment
Junming Liu, Yuqi Li, Shiping Wen, Zhigang Zeng, Tingwen Huang · Mar 7, 2026 · Citations: 0

Pairwise Preference

In this paper, we propose Hit-RAG, a multi-stage preference alignment framework designed to resolve these cognitive bottlenecks through a progressive optimization pipeline.
AutoChecklist: Composable Pipelines for Checklist Generation and Scoring with LLM-as-a-Judge
Karen Zhou, Chenhao Tan · Mar 7, 2026 · Citations: 0

Pairwise Preference

Checklists have emerged as a popular approach for interpretable and fine-grained evaluation, particularly with LLM-as-a-Judge.
Can Safety Emerge from Weak Supervision? A Systematic Analysis of Small Language Models
Punyajoy Saha, Sudipta Halder, Debjyoti Mondal, Subhadarshi Panda · Mar 7, 2026 · Citations: 0

Pairwise PreferenceRed Team

Safety alignment is critical for deploying large language models (LLMs) in real-world applications, yet most existing approaches rely on large human-annotated datasets and static red-teaming benchmarks that are costly, difficult to scale,…
A Systematic Investigation of Document Chunking Strategies and Embedding Sensitivity
Muhammad Arslan Shaukat, Muntasir Adnan, Carlos C. N. Kuhn · Mar 7, 2026 · Citations: 0

We present the first large-scale, cross-domain evaluation of document chunking strategies for dense retrieval, addressing a critical but underexplored aspect of retrieval-augmented systems.
Elenchus: Generating Knowledge Bases from Prover-Skeptic Dialogues
Bradley P. Allen · Mar 7, 2026 · Citations: 0

Expert Verification

A human expert develops a bilateral position (commitments and denials) about a topic through prover-skeptic dialogue with a large language model (LLM) opponent.
Chart-RL: Generalized Chart Comprehension via Reinforcement Learning with Verifiable Rewards
Xin Zhang, Xingyu Li, Rongguang Wang, Ruizhong Miao, Zheng Wang · Mar 7, 2026 · Citations: 0

Our experiments demonstrate that Chart-RL consistently outperforms supervised fine-tuning (SFT) across different chart understanding benchmarks, achieving relative improvements of 16.7% on MutlChartQA, and 11.5% on ChartInsights.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote