HFEPX Archive Slice
HFEPX Daily Papers for 2026-05-21
Daily archive slice for 2026-05-21 from the HFEPX corpus. Updated from current HFEPX corpus (2026-06-08); covers 60 papers from 2026-05-21.
HFEPX Archive Slice
Daily archive slice for 2026-05-21 from the HFEPX corpus. Updated from current HFEPX corpus (2026-06-08); covers 60 papers from 2026-05-21.
Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .
High-Signal Coverage
100.0%
60 / 60 papers are not low-signal flagged.
Benchmark Anchors
13.3%
Papers with benchmark/dataset mentions in extraction output.
Metric Anchors
48.3%
Papers with reported metric mentions in extraction output.
Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.
Get this digest every Friday →
SubscribeRanked by protocol completeness and evidence density for faster period-over-period review.
May 21, 2026 · Citations: 0 · Score: 6.5
Eval: Automatic Metrics · Metrics: Pass@k
May 21, 2026 · Citations: 0 · Score: 6.5
Eval: Automatic Metrics · Metrics: Accuracy
May 21, 2026 · Citations: 0 · Score: 6.5
Eval: Automatic Metrics · Metrics: F1, F1 macro
May 21, 2026 · Citations: 0 · Score: 6.5
Eval: Automatic Metrics · Metrics: Accuracy, Recall
May 21, 2026 · Citations: 0 · Score: 6.5
Eval: Automatic Metrics · Metrics: Pass@1
May 21, 2026 · Citations: 0 · Score: 6.5
Eval: Simulation Env · Metrics: Recall
Quickly compare method ingredients across this archive slice.
Gap: Human feedback
Human feedback is present in 5 of 60 papers.
Gap: Quality controls
Quality controls is present in 2 of 60 papers.
Gap: Benchmarks
Benchmarks is present in 8 of 60 papers.
Moderate: Metrics
Metrics is present in 29 of 60 papers.
Gap: Known rater population
Known rater population is present in 3 of 60 papers.
Gap: Known annotation unit
Known annotation unit is present in 6 of 60 papers.
Evaluation Modes
Top Metrics
Top Benchmarks
Quality Controls
Brett Israelsen, Sheryl Carty, Josh Coates, Nancy Fulda, Julie Park · May 21, 2026 · Citations: 0
We tested 20 commercial and open-source language models across 182 religion pairings using a human-verified LLM-as-judge framework.
Long Phan, Devin Kim, Alexander Pan, Alice Blair, Adam Khoja · May 21, 2026 · Citations: 0
We show that PCT preserves overall helpfulness, substantially reduces covert political bias, and generalizes to held-out benchmarks.
Jing Chen, Gábor Parti, Yin Zhong, Chu-Ren Huang, Marco Marelli · May 21, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Jan Tempus, Philip Whittington, Craig W. Schmidt, Dennis Komm, Tiago Pimentel · May 21, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Ryan Bahlous-Boldi, Isha Puri, Idan Shenfeld, Akarsh Kumar, Mehul Damani · May 21, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Mirac Suzgun, Emily Shen, Federico Bianchi, Alexander Spangher, Thomas Icard · May 21, 2026 · Citations: 0
We present a 14-day (February 9-22, 2026) evaluation of six AI chatbots (Gemini 3 Flash and Pro, Grok 4, Claude 4.5 Sonnet, GPT-5 and GPT-4o mini) on 2,100 factual questions derived from same-day BBC News reporting across six regional…
Pilchen Hippolyte, Fabre Romain, Signe Talla Franck, Perez Patrick, Grave Edouard · May 21, 2026 · Citations: 0
First, we introduce a comprehensive benchmark of over 7,000 temporally grounded questions and an evaluation protocol that enables analysis of whether models correctly associate facts with their corresponding time periods.
Md Shamim Ahmed, Farzaneh Firoozbakht, Lukas Galke Poech, Jan Baumbach, Richard Röttger · May 21, 2026 · Citations: 0
The graph is constructed through a disease-autonomous multi-agent pipeline in which multiple frontier LLMs independently extract knowledge from PubMed and PMC literature.
Juergen Dietrich · May 21, 2026 · Citations: 0
We investigate whether acoustic emotion recognition models can serve as proxies for the Pathos dimension in political speech analysis, as operationalised by the TRUST multi-agent large language model (LLM) pipeline.
Baiyu Chen, Zechen Li, Wilson Wongso, Lihuan Li, Xiachong Lin · May 21, 2026 · Citations: 0
As wearable and mobile devices become increasingly embedded in daily life, they offer a practical way to continuously sense human motion in the wild.
Sid-ali Temkit · May 21, 2026 · Citations: 0
Across 75,898 API calls to 11 models from 4 providers (OpenAI, Anthropic, Google, and four open-source models), we present identical test items in isolation or following histories saturated with predominantly positive or negative…
Craig W. Schmidt, Michael Krumdick, Adam Wiemerslage, Seth Ebner, Varshini Reddy · May 21, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Guangya Hao, Yitong Shang, Yunbo Long, Zhuokai Zhao, Hanxue Liang · May 21, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Maciej Skorski · May 21, 2026 · Citations: 0
Using \sim50k morally-annotated social media posts from a diverse range of topics, we apply a principled four-method validation pipeline: LaBSE cross-lingual embedding similarity, Centered Kernel Alignment (CKA), LLM-as-judge evaluation,…
Shanshan Wang, Fengying Ye, Hanjia Lyu, Caiwen Gou, Junchao Wu · May 21, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Alina Karakanta, Alex Christiansen, Tomás Dodds, Bissie Anderson, Matteo Fuoli · May 21, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Piercosma Bisconti, Matteo Prandi, Federico Pierucci, Federico Sartore, Enrico Panai · May 21, 2026 · Citations: 0
Traditional safety benchmarks for language models evaluate generated text: whether a model outputs toxic language, reproduces bias, or follows harmful instructions.
Víctor Yeste, Paolo Rosso · May 21, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Erjian Zhang, Yatong Hao, Liejun Wang, Zhiqing Guo · May 21, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Shourov Joarder, Diganta Sikdar, Ahsan Habib Akash, Binod Bhattarai, Prashnna Gyawali · May 21, 2026 · Citations: 0
Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning ability of LLMs, but often depends on external supervision from human annotations or gold-standard solutions.
Asaf Yehudai, Lilach Eden, Michal Shmueli-Scheuer · May 21, 2026 · Citations: 0
To address this gap, we present Agentic CLEAR, an automatic, dynamic, and easy-to-use evaluation framework.
Jiayi Fu, Yuxia Wang · May 21, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Meimingwei Li, Yuanhao Ding, Esteban Garces Arias, Christian Heumann · May 21, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Yuchun Fan, Bei Li, Peiguang Li, Yilin Wang, Yongyu Mu · May 21, 2026 · Citations: 0
Empirical results on challenging multilingual mathematical benchmarks reveal that LANG substantially enhances reasoning performance without compromising language consistency.
Shuaiqi Wang, Aadyaa Maddi, Zinan Lin, Giulia Fanti · May 21, 2026 · Citations: 0
We introduce SynAE, an evaluation framework for assessing how well synthetic benchmarks for multi-turn, tool-calling agents replicate and augment the characteristics of real data trajectories.
Yevhen Kostiuk, Kenneth Enevoldsen · May 21, 2026 · Citations: 0
The single-point evaluation ignores a main problem of the instruction-based approach namely: sensitivity to the phrasing of the instruction.
Yejin Cho, Katrin Erk · May 21, 2026 · Citations: 0
Our contributions are three-fold: (1) a structured representation framework for situated lexical meaning; (2) COCA-Scenes, a dataset of 520 usage instances across 26 keywords for distinct scene identification; and (3) empirical evidence…
Xiaolong Zhou, Yifei Liu, Ziyang Gong, Jiarui Li, Qiyue Zhao · May 21, 2026 · Citations: 0
Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world…
Zihan Liang, Yufei Ma, Ben Chen, Zhipeng Qian, Xuxin Zhang · May 21, 2026 · Citations: 0
Post-training has become the dominant recipe for turning a language model into a competent search-augmented reasoning agent.
Morita Tarvirdians, Senthil Chandrasegaran, Hayley Hung, Catholijn M. Jonker, Catharine Oertel · May 21, 2026 · Citations: 0
In this study, we investigate an agent designed to encourage integration by adapting to the individual user's thought patterns.
Darya Shlyk, Stefano Montanelli, Lawrence Hunter · May 21, 2026 · Citations: 0
Our method demonstrates strong performance on multiple BEL benchmarks, yielding significant improvements in linking accuracy (3%-24%) while reducing inference time compared to the state-of-the-art.
Md. Asaduzzaman Shuvo, Mahedi Hasan, Md. Tashin Parvez, Azizul Haque Noman, Md. Shafayet Hossain Ovi · May 21, 2026 · Citations: 0
To address this limitation, we introduce a novel, culturally aligned instruction-tuning dataset for BangLa Application and DialoguE generation - BLADE and benchmarking framework comprising 4,196 meticulously curated interaction pairs.
Hangyue Zhao, Paul Caillon, Erwan Fagnou, Alexandre Allauzen · May 21, 2026 · Citations: 0
Recent task-specific attention operators can compress deep Transformer stacks into a few layers by performing multi-hop state propagation within a single layer, but their dense evaluation remains expensive.
Stefan Bleeck · May 21, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Caleb Munigety · May 21, 2026 · Citations: 0
Two NLA-inspired evaluations strengthen this picture: the fifteen selective features explain only 31% of activation variance versus the SAE's 99.7%, and selectivity ratio anticorrelates with causal force (r = -0.56).
Aisha Ali Al-Athba, Wajdi Zaghouani · May 21, 2026 · Citations: 0
The annotation process combines expert human judgment with model-assisted pre-labeling verified by trained annotators, achieving substantial inter-annotator agreement (Cohens kappa = 0.85).
Genoveffa Martone, Helena Bonaldi, Marco Guerini · May 21, 2026 · Citations: 0
23 experts revise the generated CS, which are assessed via human and automatic metrics.
Jianing Yin, Tan Tang · May 21, 2026 · Citations: 0
Large language model (LLM) agents still struggle with long-term memory question answering, where answer-supporting evidence is often scattered across long conversational histories and buried in substantial irrelevant content.
Jakub Radzikowski, Josef Chen · May 21, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Xiaoyuan Li, Yubo Ma, Chengpeng Li, Fengbin Zhu, Yiyao Yu · May 21, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Pranshu Rastogi, Madhav Mathur, Ramaneswaran S, Kshitij Mohan · May 21, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Anthony Hughes, Alexander Goldberg, Prince Jha, Adam Perer, Nikolaos Aletras · May 21, 2026 · Citations: 0
Safety classifiers are essential safeguards within generative AI systems, filtering harmful content or identifying at-risk users when interacting with large language models.
Nicola Milano, Davide Marocco · May 21, 2026 · Citations: 0
Large language models are increasingly used as computational tools for modeling human-like behavior.
Hanyu Guo, Jiedong Yang, Chao Chen, Longfei Xu, Kaikui Liu · May 21, 2026 · Citations: 0
We present TransitLM, a large-scale dataset of over 13 million transit route planning records from four Chinese cities covering 120,845 stations and 13,666 lines, released as a continual pre-training corpus and benchmark data for three…
Alexis Amid Neme, Eric Laporte · May 21, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Jingyi Kang, Junyu Lu, Bo Xu, Hongbo Wang, Linlin zong · May 21, 2026 · Citations: 0
We introduce Chinese Implicit Toxicity Attack (CITA), a controlled red-team evaluation and defense-data generation framework, not a deployable evasion tool.
Kai Golan Hashiloni, Daniel Fadlon, Lior Livyatan, Ofri Hefetz, Jiahuan Pei · May 21, 2026 · Citations: 0
We introduce IdioLink, a retrieval benchmark designed to test whether models can link idiomatic expressions to conceptually equivalent meanings expressed in literal or paraphrased forms.
Yu Du, Wenlong Zhu, Xingze Li, Chenglong Cao, Jing Wang · May 21, 2026 · Citations: 0
Extensive experiments on six standard ABSA benchmarks show that GHI outperforms all baselines on the SemEval domains, and multi-seed evaluations show stable improvements over strong DeBERTa.
Sophia Xiao Pu, Zhaotian Weng, Chengzhi Liu, Jayanth Srinivasa, Gaowen Liu · May 21, 2026 · Citations: 0
Self-play reinforcement learning trains language models on their own generated tasks, co-evolving a proposer and solver without human labels.
Wajdi Zaghouani, Mabrouka Bessghaier, MD. Rafiul Biswas, Shimaa Amer Ibrahim · May 21, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Sovandara Chhoun, Pichdara Po, Sereiwathna Ros, Wan-Sup Cho, Saksonita Khoeurn · May 21, 2026 · Citations: 0
For evaluation, we perform 5-fold cross-validation over 18 question-answer pairs.
Amanda Myntti, Jenna Kanerva, Veronika Laippala, Filip Ginter · May 21, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Jinyang Wu, Guocheng Zhai, Ruihan Jin, Yuhao Shen, Zhengxi Lu · May 21, 2026 · Citations: 0
In this paper, we present Maestro (Multimodal Agent for Expert-Skill Targeted Reinforced Orchestration), a Reinforcement Learning (RL)-driven orchestration framework that reframes heterogeneous multimodal tasks as a sequential…
Luca Modica, Filip Landin, Mehrdad Farahani, Livia Qian, Gabriel Skantze · May 21, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Xing Zhang, Yanwei Cui, Guanghui Wang, Ziyuan Li, Wei Qiu · May 21, 2026 · Citations: 0
We introduce Ratchet, a single-agent loop in which a frozen LLM writes, retrieves, curates, and retires its own natural-language skills.
Chaogui Gou, Jiarui Liang · May 21, 2026 · Citations: 0
Through interactive simulation between a student agent and a counselor agent, together with a structured memory integration mechanism, Psy-Chronicle generates long-horizon dialogues with continuity across counseling sessions.
Mingkai Deng, Jinyu Hou, Lara Sá Neves, Varad Pimpalkhute, Taylor W. Killian · May 21, 2026 · Citations: 0
To test this, we develop SR^2AM (Self-Regulated Simulative Reasoning Agentic LLM), realizing both as distinct stages within an LLM's chain-of-thought, with the LLM as world model.
Andrew Ivan Soegeng, Patrick Sutanto, Tan Sang Nguyen · May 21, 2026 · Citations: 0
Evaluations on the BLEnD benchmark demonstrate that our approach significantly improves cultural alignment-boosting performance on English queries by an average of 5.03%-relying entirely on self-generated data.
Sereiwathna Ros, Phannet Pov, Ratanaktepi Chhor, Kimleang Ly, Wan-Sup Cho · May 21, 2026 · Citations: 0
We conduct a two-phase comparative evaluation.
Wajdi Zaghouani, Shimaa Amer Ibrahim, Mabrouka Bessghaier, Houda Bouamor · May 21, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.