HFEPX Archive Slice
HFEPX Daily Papers for 2026-05-28
Daily archive slice for 2026-05-28 from the HFEPX corpus. Updated from current HFEPX corpus (2026-06-01); covers 60 papers from 2026-05-28.
HFEPX Archive Slice
Daily archive slice for 2026-05-28 from the HFEPX corpus. Updated from current HFEPX corpus (2026-06-01); covers 60 papers from 2026-05-28.
Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .
High-Signal Coverage
100.0%
60 / 60 papers are not low-signal flagged.
Benchmark Anchors
13.3%
Papers with benchmark/dataset mentions in extraction output.
Metric Anchors
30.0%
Papers with reported metric mentions in extraction output.
Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.
Get this digest every Friday →
SubscribeRanked by protocol completeness and evidence density for faster period-over-period review.
May 28, 2026 · Citations: 0 · Score: 7.5
Eval: Automatic Metrics · Metrics: Success rate, Jailbreak success rate
May 28, 2026 · Citations: 0 · Score: 6.5
Eval: Automatic Metrics · Metrics: Accuracy
May 28, 2026 · Citations: 0 · Score: 6.5
Eval: Simulation Env · Metrics: Task success
May 28, 2026 · Citations: 0 · Score: 6.0
Eval: Automatic Metrics · Metrics: Accuracy
May 28, 2026 · Citations: 0 · Score: 6.0
Eval: Automatic Metrics · Metrics: Accuracy, Recall
May 28, 2026 · Citations: 0 · Score: 6.0
Eval: Automatic Metrics · Metrics: Accuracy, Spearman
Quickly compare method ingredients across this archive slice.
Moderate: Human feedback
Human feedback is present in 17 of 60 papers.
Gap: Quality controls
Quality controls is present in 3 of 60 papers.
Gap: Benchmarks
Benchmarks is present in 8 of 60 papers.
Moderate: Metrics
Metrics is present in 18 of 60 papers.
Gap: Known rater population
Known rater population is present in 4 of 60 papers.
Gap: Known annotation unit
Known annotation unit is present in 10 of 60 papers.
Evaluation Modes
Top Metrics
Top Benchmarks
Quality Controls
Yaxin Luo, Jiacheng Cui, Xiaohan Zhao, Xinyi Shang, Jiacheng Liu · May 28, 2026 · Citations: 0
To evaluate, we introduce LLMScan, a recipe-verifiable evaluation suite built from open-source LLMs with transparent pretraining mixtures.
Qinpei Luo, Ruichun Ma, Xinyu Zhang, Lili Qiu · May 28, 2026 · Citations: 0
We further construct a large-scale dataset of PCB schematics paired with user prompts via a human-agent collaborative pipeline that converts open-source hardware designs into our representation.
Lukas Aichberger, Sepp Hochreiter · May 28, 2026 · Citations: 0
In contrast, human cognition can use working memory to hold and manipulate information internally without the need to externalize intermediate thoughts.
Anany Kotawala · May 28, 2026 · Citations: 0
Multi-component LLM agents assemble probabilistic claims from components that each see only part of a joint problem; the composition can violate basic probability axioms even when every component is locally coherent.
Yalun Dai, Yangyu Huang, Tongshen Yang, Yonghan Wang, Xin Zhang · May 28, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
David Busbib, Michael Werman · May 28, 2026 · Citations: 0
To support this setting, we construct a dataset of 108K paired scientific-formal graph examples from arXiv and Mathlib, together with a benchmark of 47K future papers from 2024--2025.
Felix Zhou, Anay Mehrotra, Quanquan C. Liu · May 28, 2026 · Citations: 0
Across MATH500, HumanEval, GPQA Diamond, and AIME26, our method consistently improves over baselines and RL-trained models.
Jon Kleinberg, Anay Mehrotra, Amin Saberi, Grigoris Velegkas · May 28, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Anany Kotawala · May 28, 2026 · Citations: 0
Across two public LLM leaderboards, many displayed pairwise rankings do not meet a conventional paired-test resolution target under the actual paired evaluation design: 11 of 40 Open LLM Leaderboard v1 pairwise comparisons and 4 of 9…
Valentina Bui Muti, Eugénie Dulout, Ziquan Fu · May 28, 2026 · Citations: 0
We introduce a pipeline for generating clinically realistic HL7 FHIR R4 bundles from unstructured text, enabling controllable evaluation of clinical decision support systems.
Chen Henry Wu, Aditi Raghunathan · May 28, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Qiuyue Wang, Mingsheng Li, Jian Guan, Jinhui Ye, Sicheng Xie · May 28, 2026 · Citations: 0
Qwen-VLA is trained with a large-scale joint pretraining recipe over diverse data sources, including robotics manipulation trajectories, human egocentric demonstrations, synthetic simulation data, vision-and-language navigation data,…
Yutong Wang, Xuebo Liu, Derek F. Wong, Zhilin Li, Rongqing Jiang · May 28, 2026 · Citations: 0
To address this, we propose a human-like long document translation agent called Loong, which leverages a 3E memory module (Essence-Exemplar-Entity) to store summaries, sentence pairs, and entity records as historical context.
Jiwon Kim, Maya Ajit, Sherry Gong, Soorya Ram Shimgekar, Dong Whi Yoo · May 28, 2026 · Citations: 0
Large language models (LLMs) show promise in generating supportive responses for mental health queries, but improving their usefulness, empathy, and safety often requires substantial compute, expert input, and labeled data.
Feng Han, Zhixiong Zhang, Zheming Liang, Yibin Wang, Jiaqi Wang · May 28, 2026 · Citations: 0
Such data bias leads VLMs to exhibit distinct preferences for information acquisition across different modalities.
Ziwen Xu, Haiwen Hong, Linsong Yu, Benglei Cui, Longtao Huang · May 28, 2026 · Citations: 0
While Low-Rank Adaptation (LoRA) is widely used for such memory updates, existing studies mainly rely on qualitative downstream evaluations, leaving the quantitative capacity limits and underlying dynamics of exact parametric memory largely…
Amrita Mazumdar, Seonwook Park, Rajarshi Roy, Nikhil Srihari, Shengze Wang · May 28, 2026 · Citations: 0
In this work, we present VideoFDB, the first benchmark to evaluate full-duplex audio-visual-to-audio-visual (AV2AV) conversational agents.
Zizhuo Lin, Quanling Liu, Jinsheng Quan, Chao Zhang, Yifan Zhu · May 28, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Shaojie Wang, Liang Zhang · May 28, 2026 · Citations: 0
Experiments across four backbones and five mathematical reasoning benchmarks show that PPC achieves the best results on 39 of 40 metrics, improving maj@16 and pass@16 by +2.23 and +3.06 over the strongest baseline without introducing…
Sahajpreet Singh, Insyirah Mujtahid, Min-Yen Kan, Kokil Jaidka · May 28, 2026 · Citations: 0
Misinformation verification increasingly occurs in public, fast-moving, and multilingual online settings, where static benchmarks provide an incomplete measure of model reliability.
Yicheng Tao, Yiqun Wang, Xiangchen Song, Xin Luo, Kai Liu · May 28, 2026 · Citations: 0
GRASP substantially advances the state of the art on every metric across the three STaRK benchmarks, lifting average Hit@1 from 62.0 to 73.9.
Zilu Tang, Qiao Zhao, Gabriel Franco, Derry Wijaya, Aaron Mueller · May 28, 2026 · Citations: 0
Behavioral results inform mechanistic hypotheses, and insights from mechanistic analyses help build stronger behavioral evaluations by predicting failure modes missing from existing evaluations.
Andy Q Han, David J. Chalmers, Pavel Izmailov · May 28, 2026 · Citations: 0
We present evidence that RL recruits a pre-existing representation of functional welfare: an estimate of how well or badly the system is doing, relative to its goals.
Haoming Xu, Weihong Xu, Zongrui Li, Mengru Wang, Yunzhi Yao · May 28, 2026 · Citations: 0
To make CBM measurable, we introduce BeliefTrack, a closed-world benchmark spanning Rule Discovery and Circuit Diagnosis, where a finite belief space and symbolic verifiers enable exact turn-level evaluation.
Fabian Mewes, Anne Lauscher, Vagrant Gautam · May 28, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Markus Frey, Behzad Shomali, Joachim Koehler, Mehdi Ali · May 28, 2026 · Citations: 0
We show that across two FLOP budgets, our dual-path model surpasses iso-FLOP matched models on language modeling and downstream evaluations, while using fewer parameters than the baseline at matched FLOPs.
Travis Lelle · May 28, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Xiaoze Liu, Ruowang Zhang, Amir H. Abdi, Michel Galley, Zhikai Chen · May 28, 2026 · Citations: 0
Proactive agents read user activity as text and call an LLM on every event to decide whether to act.
Milan Straka · May 28, 2026 · Citations: 0
Furthermore, we perform a series of ablation experiments with different model sizes, empty node prediction methods, and cross-lingual zero-shot evaluation.
Xi Zhang, Yingshu Li, Zaiqiao Meng, Jake Lever, Edmond S. L. Ho · May 28, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Selim Kuzucu, Alessio Tonioni, Vasile Lup, Bernt Schiele, Federico Tombari · May 28, 2026 · Citations: 0
Extensive evaluations across 27 benchmarks show that PARCEL improves the performance-efficiency Pareto frontier, consistently outperforming existing matryoshka baselines across visual-token budgets while preserving the "train once, deploy…
Songbo Hu, Yinhong Liu, Ej Zhou, Evgeniia Razumovskaia, Xiaobin Wang · May 28, 2026 · Citations: 0
We report benchmark results across key dialogue tasks, which reveal consistent performance disparities across languages, even among high-resource ones.
Jiamin Chen, Yidi Wu, Qiexiang Wang, Qianben Chen, Yuchen Li · May 28, 2026 · Citations: 0
Therefore, we present Seeded Elimination with Adaptive LLM-as-a-Meta-Judge, a self-improving evaluation protocol for extracting latent ranking signal from saturated benchmarks.
Jiamin Chen, Qianben Chen, Jiawen Zhang, Yidi Wu, Yuchen Li · May 28, 2026 · Citations: 0
However, evaluating such videos remains challenging, since existing benchmarks largely focus on local visual quality, short-horizon temporal consistency, or generic prompt alignment, and provide limited diagnosis of workflow failures and…
Matt Y. Cheung, Ashok Veeraraghavan, Hanjie Chen, Guha Balakrishnan · May 28, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Thang Dang, Akira Nakagawa, Kenichi Kobayashi, Koichi Shirahata · May 28, 2026 · Citations: 0
Evaluations conducted on the FineWeb-Edu 100B dataset demonstrate that hierarchical models equipped with ATDC achieve competitive Bits-Per-Byte (BPB) performance compared to conventional baselines operating at both byte and token levels.
Yingdong Shi, Ruiming Zhang, Changming Li, Zhiyu Yang, Kaixing Zhang · May 28, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Weihan Peng, Chenxu Zhang, Qianao Wang, Yuling Shi, Heng Lian · May 28, 2026 · Citations: 0
While LLM agents have demonstrated remarkable task-oriented abilities such as planning, reasoning, and action, few works have treated them as complete human personalities where emotional dimensions hold equal importance.
Parsa Mazaheri · May 28, 2026 · Citations: 0
On Derail-550, our controlled recovery benchmark, every condition with access to checkpoint information clears >=30% on GPT-medium and >=70% on Gemini, vs <=3.1% for error-only feedback -- showing that checkpoint information, not the…
Zhangqi Duan, Shuyan Huang, Alexander Scarlatos, Jaewook Lee, Simon Woodhead · May 28, 2026 · Citations: 0
A key part of developing large language model (LLM)-powered, automated tutoring tools is student simulation, i.e., using LLMs to role-play as students, which can facilitate tutor model evaluation and training.
Shahinul Hoque, Jinghuai Zhang, Jinyuan Sun, Fnu Suya · May 28, 2026 · Citations: 0
We show that this kind of billing is hard to audit by design: providers hide the model, the tokenizer, and the execution to protect their IP, mitigate jailbreaks, and preserve user privacy, which means an auditor can only inspect proofs the…
Asaf Yehudai, Naama Rozen, Ariel Gera · May 28, 2026 · Citations: 0
Large Language Models (LLMs) demonstrate a remarkable capacity to adopt different personas and roles; however, it remains unclear whether they can manifest behavior that adheres to a coherent, human-like value structure.
Bo-Han Feng, Yu-Hsuan Li Liang, Chien-Feng Liu, You-Hsuan Chang, Yun-Nung Chen · May 28, 2026 · Citations: 0
Existing work studies these risks under heterogeneous threat models and evaluation protocols, making it difficult to compare attack practicality or defense utility.
Pierre-Antoine Lequeu, Camille Barboule, Benjamin Piwowarski · May 28, 2026 · Citations: 0
The disentangled approach preserves positional encoding, which improves linguistic representation on 49 of the 65 linguistic phenomena of the Flash-Holmes probing benchmark.
Vinay Samuel, Yapei Chang, Mohit Iyyer · May 28, 2026 · Citations: 0
For each prompt, REDIPO samples responses from both base and instruct models, rewrites base-model responses with the instruct model, filters candidates for safety and instruction-following quality, and builds preference pairs that favor…
Tanmoy Chakraborty, Ayan Sengupta, Suparna Bhattacharya, Partha Pratim Chakrabarti, Amlan Chakrabarti · May 28, 2026 · Citations: 0
Large language models (LLMs) frequently achieve impressive scores on standardized benchmarks, yet accuracy alone offers a limited view of their capabilities.
M. Ali Bayram, Banu Diri, Savaş Yıldırım · May 28, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Dang Hong Nguyen, Nhi Ngoc-Yen Nguyen, Huy-Hieu Pham · May 28, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Zhenghao Herbert Zhou, R. Thomas McCoy, Robert Frank · May 28, 2026 · Citations: 0
We show that verb bias is causally represented in steering vectors extracted from large language models: counterfactual edits to verb bias systematically shift downstream structural preferences.
Anisha Saha, Varsha Suresh, Teodora Kamova, Sophia Wiedmann, Timothy Hospedales · May 28, 2026 · Citations: 0
Our findings suggest that reasoning-oriented reward optimization offers a promising direction towards building multimodal systems that generalize beyond benchmark-specific shortcuts.
Alejandra Zambrano, Sara Vera Marjanovic, Imene Kerboua, Xing Han Lù, Leila Kosseim · May 28, 2026 · Citations: 0
To address this, we introduce PlanAhead, a static planner-executor framework that evaluates the impact of plan representation in agent performance.
Christoph Leiter, Haiyue Song, Hour Kaing, Jin Tei, Hideki Tanaka · May 28, 2026 · Citations: 0
To address this gap, we introduce ExCAM, an Explainable Cultural Awareness Metric, which is, to our knowledge, the first dedicated evaluation metric that identifies, rates and explains cultural errors in instruction-output pairs.
David Fraile Navarro, Berardino Como, Jialei Sheng, Soundariya Ananthan, Shlomo Berkovsky · May 28, 2026 · Citations: 0
Patient-voiced clinical-triage benchmarks report high under-triage rates for consumer LLMs for constrained multiple-choice output, yet the same cases score differently with free-text.
Wenhan Xiao, Ziwei Zhang, Chuanyue Yu, Xingcheng Fu, Qingyun Sun · May 28, 2026 · Citations: 0
To learn these capabilities, we design two reward functions: Conservative Judgement Alignment (CJA) first encourages calibrated high-level judgements while mitigating the over-aggressive phenomenon, whereas Diagnostic Quality Alignment…
Chenghao Zhang, Guanting Dong, Yufan Liu, Tong Zhao, Zhicheng Dou · May 28, 2026 · Citations: 0
Large Language Models (LLMs) have advanced autonomous agents from deep search, which retrieves concise factual answers, to deep research, which synthesizes scattered evidence into long-form reports.
Sung-Lin Yeh, Wei Zhou, Gil Keren, Duc Le, Zhong Meng · May 28, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Xin Guan, Xiaomeng Hu, Shen Huang, Zhenyi Wang, Bo Zhang · May 28, 2026 · Citations: 0
Current rubric-based RL methods mitigate this by employing explicit criteria; however, they rely heavily on static, human-annotated rubrics that inevitably cause policy lag, or expensive external proprietary models for dynamic updates.
Leijiang Gu, Zhen Zeng, Feng Li, Xinjian Gao, Zenglin Shi · May 28, 2026 · Citations: 0
Extensive experiments across various benchmarks and MLLMs demonstrate that LDKE achieves superior performance in propagating edits to related contexts while maintaining high locality.
Krzysztof Żurawicki, Julia Farganus, Arkadiusz Gaweł, Mateusz Bystroński, Tomasz Jan Kajdanowicz · May 28, 2026 · Citations: 0
Yet, it remains unknown whether LLMs engage with scientific manuscripts in the same manner as human reviewers, or whether they merely produce review-looking text.
Egor Shevchenko, Elena Bruches · May 28, 2026 · Citations: 0
Label errors, present even in widely used benchmarks, introduce noise into training data and reduce model generalization.