HFEPX Archive Slice

HFEPX Daily Archive: 2026-03-27

Updated from current HFEPX corpus (Apr 9, 2026). 49 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 9, 2026). 49 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Multi Dim Rubric. Frequent quality control: Calibration. Frequently cited benchmark: Codabench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 27, 2026.

Papers: 49 Last published: Mar 27, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

High-Signal Coverage

100.0%

49 / 49 papers are not low-signal flagged.

Benchmark Anchors

22.4%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

46.9%

Papers with reported metric mentions in extraction output.

2 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.

Get this digest every Friday →

Why This Time Slice Matters

12.2% of papers report explicit human-feedback signals, led by expert verification.
automatic metrics appears in 36.7% of papers in this hub.
Codabench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is rater calibration (4.1% of papers).
Rater context is mostly domain experts, and annotation is commonly multi-dimensional rubrics; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Stabilizing Rubric Integration Training via Decoupled Advantage Normalization
Mar 27, 2026 · Citations: 0 · Score: 7.5

Eval: Automatic Metrics · Metrics: Accuracy
ClimateCheck 2026: Scientific Fact-Checking and Disinformation Narrative Classification of Climate-related Claims
Mar 27, 2026 · Citations: 0 · Score: 7.5

Eval: Automatic Metrics · Metrics: Recall, Recall@k
Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation
Mar 27, 2026 · Citations: 0 · Score: 7.5

Eval: Automatic Metrics · Metrics: Success rate
Automating Clinical Information Retrieval from Finnish Electronic Health Records Using Large Language Models
Mar 27, 2026 · Citations: 0 · Score: 7.0

Eval: Automatic Metrics · Metrics: Accuracy, Precision
FormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified?
Mar 27, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Accuracy, Latency
Analysing Calls to Order in German Parliamentary Debates
Mar 27, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Relevance

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Stabilizing Rubric Integration Training via Decoupled Advantage Normalization Mar 27, 2026	Automatic Metrics	Olympiadbench	Accuracy	Not reported
ClimateCheck 2026: Scientific Fact-Checking and Disinformation Narrative Classification of Climate-related Claims Mar 27, 2026	Automatic Metrics	Codabench	Recall, Recall@k	Not reported
Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation Mar 27, 2026	Automatic Metrics	Xpertbench	Success rate	Not reported
Automating Clinical Information Retrieval from Finnish Electronic Health Records Using Large Language Models Mar 27, 2026	Automatic Metrics	Not reported	Accuracy, Precision	Calibration
FormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified? Mar 27, 2026	Automatic Metrics	Formalproofbench	Accuracy, Latency	Not reported
Analysing Calls to Order in German Parliamentary Debates Mar 27, 2026	Automatic Metrics	LMSYS Chatbot Arena	Relevance	Not reported
DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models Mar 27, 2026	Automatic Metrics	MMLU, DROP	Accuracy, Perplexity	Not reported
GS-BrainText: A Multi-Site Brain Imaging Report Dataset from Generation Scotland for Clinical Natural Language Processing Development and Validation Mar 27, 2026	Automatic Metrics	Not reported	F1	Calibration
ClinicalAgents: Multi-Agent Orchestration for Clinical Decision Making with Dual-Memory Mar 27, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
Why Models Know But Don't Say: Chain-of-Thought Faithfulness Divergence Between Thinking Tokens and Answers in Open-Weight Reasoning Models Mar 27, 2026	Not reported	MMLU, GPQA	Faithfulness	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (12.2% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (4.1% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (8.2% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (12.2% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (14.3% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (8.2% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 4.1% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (14.3% coverage).
Annotation unit is under-specified (8.2% coverage).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (Codabench vs GSM8K) before comparing methods.
Track metric sensitivity by reporting both accuracy and precision.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: Codabench Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 4.1% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (14.3% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (18)
Llm As Judge (1)

Top Metrics

Accuracy (4)
Precision (1)
Recall (1)
Recall@k (1)

Top Benchmarks

Codabench (1)
GSM8K (1)
Olympiadbench (1)
Xpertbench (1)

Quality Controls

Calibration (2)

Papers In This Archive Slice

Unsupervised Behavioral Compression: Learning Low-Dimensional Policy Manifolds through State-Occupancy Matching
Andrea Fraschini, Davide Tenedini, Riccardo Zamboni, Mirco Mutti, Marcello Restelli · Mar 27, 2026 · Citations: 0

Long Horizon

Finally, we empirically validate the advantages of our contributions across multiple continuous control benchmarks.
Introducing MELI: the Mandarin-English Language Interview Corpus
Suyuan Liu, Molly Babel · Mar 27, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
TAPS: Task Aware Proposal Distributions for Speculative Sampling
Mohamad Zbib, Mohamad Bazzi, Ammar Mohanna, Hasan Abed Al Kader Hammoud, Bernard Ghanem · Mar 27, 2026 · Citations: 0

Measured by acceptance length, task-specific training yields clear specialization: MathInstruct-trained drafts are strongest on reasoning benchmarks, while ShareGPT-trained drafts are strongest on MT-Bench.
Pashto Common Voice: Building the First Open Speech Corpus for a 60-Million-Speaker Low-Resource Language
Hanif Rahman, Shafeeq ur Rehman · Mar 27, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
RASPRef: Retrieval-Augmented Self-Supervised Prompt Refinement for Large Reasoning Models
Rahul Soni · Mar 27, 2026 · Citations: 0

Critique Edit Long Horizon

Recent reasoning-focused language models such as DeepSeek R1 and OpenAI o1 have demonstrated strong performance on structured reasoning benchmarks including GSM8K, MATH, and multi-hop question answering tasks.
The Last Fingerprint: How Markdown Training Shapes LLM Prose
E. M. Freeburg · Mar 27, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
PHONOS: PHOnetic Neutralization for Online Streaming Applications
Waris Quamer, Mu-Ruei Tseng, Ghady Nasrallah, Ricardo Gutierrez-Osuna · Mar 27, 2026 · Citations: 0

Our evaluations show an 81% reduction in non-native accent confidence, with listening-test ratings consistent with this shift, and reduced speaker linkability as accent-neutralized utterances move away from the original speaker in embedding…
FormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified?
Nikil Ravi, Kexing Ying, Vasilii Nesterov, Rayan Krishnan, Elif Uskuplu · Mar 27, 2026 · Citations: 0

We present FormalProofBench, a private benchmark designed to evaluate whether AI models can produce formally verified mathematical proofs at the graduate level.
A large corpus of lucid and non-lucid dream reports
Remington Mallett · Mar 27, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Multilingual Stutter Event Detection for English, German, and Mandarin Speech
Felix Haas, Sebastian P. Bayerl · Mar 27, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
In your own words: computationally identifying interpretable themes in free-text survey data
Jenny S Wang, Aliya Saperstein, Emma Pierson · Mar 27, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Magic Words or Methodical Work? Challenging Conventional Wisdom in LLM-Based Political Text Annotation
Lorcan McLaren, James Cross, Zuzanna Krakowska, Robin Rauner, Martijn Schoonvelde · Mar 27, 2026 · Citations: 0

Most evaluations test a single model or configuration; how model choice, model size, learning approach, and prompt style interact, and whether popular "best practices" survive controlled comparison, are largely unexplored.
Learning to Commit: Generating Organic Pull Requests via Online Repository Memory
Mo Li, L. H. Xu, Qitai Tan, Ting Cao, Yunxin Liu · Mar 27, 2026 · Citations: 0

Large language model (LLM)-based coding agents achieve impressive results on controlled benchmarks yet routinely produce pull requests that real maintainers reject.
Weight Tying Biases Token Embeddings Towards the Output Space
Antonio Lopardo, Avyukth Harish, Catherine Arnett, Akshat Gupta · Mar 27, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning
Shaoxuan Li, Zhixuan Zhao, Hanze Deng, Zirun Ma, Shulin Tian · Mar 27, 2026 · Citations: 0

Long Horizon

We introduce PerceptionComp, a manually annotated benchmark for complex, long-horizon, perception-centric video reasoning.
EnTaCs: Analyzing the Relationship Between Sentiment and Language Choice in English-Tamil Code-Switching
Paul Bontempo · Mar 27, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
MemBoost: A Memory-Boosted Framework for Cost-Aware LLM Inference
Joris Köster, Zixuan Liu, Siavash Khajavi, Zizhan Zheng · Mar 27, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
When Perplexity Lies: Generation-Focused Distillation of Hybrid Sequence Models
Juan Gabriel Kostelec, Xiang Wang, Axel Laborieux, Christos Sourmpis, Qinghai Guo · Mar 27, 2026 · Citations: 0

We propose a Hybrid Kimi Delta Attention (Hybrid-KDA) architecture paired with GenDistill, a multi-stage distillation pipeline, and use generation-based evaluation throughout to guide design decisions.
Development of a European Union Time-Indexed Reference Dataset for Assessing the Performance of Signal Detection Methods in Pharmacovigilance using a Large Language Model
Maria Kefala, Jeffery L. Painter, Syed Tauhid Bukhari, Maurizio Sessa · Mar 27, 2026 · Citations: 0

Existing datasets do not capture when adverse events (AEs) are officially recognized by regulatory authorities, preventing restriction of analyses to pre-confirmation periods and limiting evaluation of early detection performance.
How Open Must Language Models be to Enable Reliable Scientific Inference?
James A. Michaelov, Catherine Arnett, Tyler A. Chang, Pamela D. Rivière, Samuel M. Taylor · Mar 27, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Stabilizing Rubric Integration Training via Decoupled Advantage Normalization
Zelin Tan, Zhouliang Yu, Bohan Lin, Zijie Geng, Hejia Geng · Mar 27, 2026 · Citations: 0

Rubric Rating

We propose Process-Aware Policy Optimization (PAPO), a method that integrates process-level evaluation into Group Relative Policy Optimization (GRPO) through decoupled advantage normalization, to address two limitations of existing reward…
ALBA: A European Portuguese Benchmark for Evaluating Language and Linguistic Dimensions in Generative LLMs
Inês Vieira, Inês Calvo, Iago Paulo, James Furtado, Rafael Ferreira · Mar 27, 2026 · Citations: 0

European Portuguese (pt-PT) is particularly affected, as existing training data and benchmarks are mainly in Brazilian Portuguese (pt-BR).
JAL-Turn: Joint Acoustic-Linguistic Modeling for Real-Time and Robust Turn-Taking Detection in Full-Duplex Spoken Dialogue Systems
Guangzhao Yang, Yu Pan, Shi Qiu, Ningjie Bai · Mar 27, 2026 · Citations: 0

Despite recent advances, efficient and robust turn-taking detection remains a significant challenge in industrial-grade Voice AI agent deployments.
AMALIA Technical Report: A Fully Open Source Large Language Model for European Portuguese
Afonso Simplício, Gonçalo Vinagre, Miguel Moura Ramos, Diogo Tavares, Rafael Ferreira · Mar 27, 2026 · Citations: 0

Despite rapid progress in open large language models (LLMs), European Portuguese (pt-PT) remains underrepresented in both training data and native evaluation, with machine-translated benchmarks likely missing the variant's linguistic and…
Clinical named entity recognition in the Portuguese language: a benchmark of modern BERT models and LLMs
Vinicius Anjos de Almeida, Sandro Saorin da Silva, Josimar Chire, Leonardo Vicenzi, Nícolas Henrique Borges · Mar 27, 2026 · Citations: 0

Named entity recognition (NER) enables the automatic extraction of medical concepts; however, benchmarks for Portuguese remain scarce.
Entanglement as Memory: Mechanistic Interpretability of Quantum Language Models
Nathan Roll · Mar 27, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
ClimateCheck 2026: Scientific Fact-Checking and Disinformation Narrative Classification of Climate-related Claims
Raia Abu Ahmad, Max Upravitelev, Aida Usmanova, Veronika Solopova, Georg Rehm · Mar 27, 2026 · Citations: 0

Pairwise Preference

In addition to standard evaluation metrics (Recall@K and Binary Preference), we adapt an automated framework to assess retrieval quality under incomplete annotations, exposing systematic biases in how conventional metrics rank systems.
Automating Clinical Information Retrieval from Finnish Electronic Health Records Using Large Language Models
Mikko Saukkoriipi, Nicole Hernandez, Jaakko Sahlsten, Kimmo Kaski, Otso Arponen · Mar 27, 2026 · Citations: 0

Expert Verification

Open-source large language models (LLMs) ranging from 4B to 70B parameters were benchmarked under fully offline conditions using 1,664 expert-annotated question-answer pairs derived from records of 183 patients.
Analysing Calls to Order in German Parliamentary Debates
Nina Smirnova, Daniel Dan, Philipp Mayr · Mar 27, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Why Models Know But Don't Say: Chain-of-Thought Faithfulness Divergence Between Thinking Tokens and Answers in Open-Weight Reasoning Models
Richard J. Young · Mar 27, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Word Alignment-Based Evaluation of Uniform Meaning Representations
Daniel Zeman, Federica Gamba · Mar 27, 2026 · Citations: 0

Comparison and evaluation of graph-based representations of sentence meaning is a challenge because competing representations of the same sentence may have different number of nodes, and it is not obvious which nodes should be compared to…
Switch Attention: Towards Dynamic and Fine-grained Hybrid Transformers
Yusheng Zhao, Hourun Li, Bohan Wu, Jingyang Yuan, Meng Zhang · Mar 27, 2026 · Citations: 0

Extensive experiments are conducted on twenty-three benchmark datasets across both regular (4K) and long (32K) context lengths, demonstrating the effectiveness of the proposed method.
A Formal Framework for Uncertainty Analysis of Text Generation with Large Language Models
Steffen Herbold, Florian Lemmerich · Mar 27, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
CALRK-Bench: Evaluating Context-Aware Legal Reasoning in Korean Law
JiHyeok Jung, TaeYoung Yoon, HyunSouk Cho · Mar 27, 2026 · Citations: 0

However, existing legal benchmarks primarily evaluate rule application under the assumption of fixed norms, and thus fail to capture situations where legal judgments shift or where multiple norms interact.
From Human Cognition to Neural Activations: Probing the Computational Primitives of Spatial Reasoning in LLMs
Jiyuan An, Liner Yang, Mengyan Wang, Luming Lu, Weihua An · Mar 27, 2026 · Citations: 0

As spatial intelligence becomes an increasingly important capability for foundation models, it remains unclear whether large language models' (LLMs) performance on spatial reasoning benchmarks reflects structured internal spatial…
Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation
Xue Liu, Xin Ma, Yuxin Ma, Yongchang Peng, Duo Wang · Mar 27, 2026 · Citations: 0

Rubric RatingExpert Verification

To bridge this gap, we present XpertBench, a high-fidelity benchmark engineered to assess LLMs across authentic professional domains.
findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding
Héctor Javier Vázquez Martínez · Mar 27, 2026 · Citations: 0

Syllable-level units offer compact and linguistically meaningful representations for spoken language modeling and unsupervised word discovery, but research on syllabification remains fragmented across disparate implementations, datasets,…
Working Notes on Late Interaction Dynamics: Analyzing Targeted Behaviors of Late Interaction Models
Antoine Edy, Max Conti, Quentin Macé · Mar 27, 2026 · Citations: 0

We analyze these behaviors for state-of-the-art models on the NanoBEIR benchmark.
SocialX: A Modular Platform for Multi-Source Big Data Research in Indonesia
Muhammad Apriandito Arya Saputra, Andry Alamsyah, Dian Puteri Ramadhani, Thomhert Suprapto Siadari, Hanif Fakhrurroja · Mar 27, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Automatic Speech Recognition for Documenting Endangered Languages: Case Study of Ikema Miyakoan
Chihiro Taguchi, Yukinori Takubo, David Chiang · Mar 27, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Distilling Conversations: Abstract Compression of Conversational Audio Context for LLM-based ASR
Shashi Kumar, Esaú Villatoro-Tello, Sergio Burdisso, Kadri Hacioglu, Thibault Bañeras-Roux · Mar 27, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
A Universal Vibe? Finding and Controlling Language-Agnostic Informal Register with SAEs
Uri Z. Kialy, Avi Shtarkberg, Ayal Klein · Mar 27, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
GS-BrainText: A Multi-Site Brain Imaging Report Dataset from Generation Scotland for Clinical Natural Language Processing Development and Validation
Beatrice Alex, Claire Grover, Arlene Casey, Richard Tobin, Heather Whalley · Mar 27, 2026 · Citations: 0

Benchmark evaluation using EdIE-R, an existing rule-based NLP system developed in conjunction with the annotation schema, revealed some performance variation across health boards (F1: 86.13-98.13), phenotypes (F1: 22.22-100) and age groups…
Ask or Assume? Uncertainty-Aware Clarification-Seeking in Coding Agents
Nicholas Edwards, Sebastian Schuster · Mar 27, 2026 · Citations: 0

Multi Agent

We propose an uncertainty-aware multi-agent scaffold that explicitly decouples underspecification detection from code execution.
Sparse Auto-Encoders and Holism about Large Language Models
Jumbly Grindrod · Mar 27, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
An Object Web Seminar: A Retrospective on a Technical Dialogue Still Reverberating
James J. Cusick · Mar 27, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
ClinicalAgents: Multi-Agent Orchestration for Clinical Decision Making with Dual-Memory
Zhuohan Ge, Haoyang Li, Yubo Wang, Nicole Hu, Chen Jason Zhang · Mar 27, 2026 · Citations: 0

Expert Verification Multi Agent

To bridge this gap, we introduce ClinicalAgents, a novel multi-agent framework designed to simulate the cognitive workflow of expert clinicians.
DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models
Hao Liang, Zhengyang Zhao, Meiyi Qiang, Mingrui Chen, Lu Ma · Mar 27, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Clash of the models: Comparing performance of BERT-based variants for generic news frame detection
Vihang Jumle · Mar 27, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote