HFEPX Archive Slice
HFEPX Daily Papers for 2026-06-17
Daily archive slice for 2026-06-17 from the HFEPX corpus. Updated from current HFEPX corpus (2026-06-22); covers 60 papers from 2026-06-17.
HFEPX Archive Slice
Daily archive slice for 2026-06-17 from the HFEPX corpus. Updated from current HFEPX corpus (2026-06-22); covers 60 papers from 2026-06-17.
Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .
High-Signal Coverage
100.0%
60 / 60 papers are not low-signal flagged.
Benchmark Anchors
11.7%
Papers with benchmark/dataset mentions in extraction output.
Metric Anchors
15.0%
Papers with reported metric mentions in extraction output.
Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.
Get this digest every Friday →
SubscribeRanked by protocol completeness and evidence density for faster period-over-period review.
Jun 17, 2026 · Citations: 0 · Score: 8.5
Eval: Llm As Judge, Automatic Metrics · Metrics: Exact match, Kappa
Jun 17, 2026 · Citations: 0 · Score: 7.5
Eval: Automatic Metrics · Metrics: Accuracy, Precision
Jun 17, 2026 · Citations: 0 · Score: 6.5
Eval: Simulation Env · Metrics: Success rate
Jun 17, 2026 · Citations: 0 · Score: 6.5
Eval: Automatic Metrics, Simulation Env · Metrics: F1, Latency
Jun 17, 2026 · Citations: 0 · Score: 6.0
Eval: Automatic Metrics · Metrics: Agreement
Jun 17, 2026 · Citations: 0 · Score: 5.0
Eval: Automatic Metrics · Metrics: Success rate, Latency
Quickly compare method ingredients across this archive slice.
Gap: Human feedback
Human feedback is present in 2 of 60 papers.
Gap: Quality controls
Quality controls is present in 3 of 60 papers.
Gap: Benchmarks
Benchmarks is present in 7 of 60 papers.
Gap: Metrics
Metrics is present in 9 of 60 papers.
Gap: Known rater population
Known rater population is present in 2 of 60 papers.
Gap: Known annotation unit
Known annotation unit is present in 5 of 60 papers.
Evaluation Modes
Top Metrics
Top Benchmarks
Quality Controls
Tianming Du, Peijie Yu, Sihan Shang, Danli Shi, My Linh Nguyen · Jun 17, 2026 · Citations: 0
The most plausible near-term role of medical LLMs is to assist rather than replace physicians, yet current evaluations often test isolated capabilities: clinical knowledge, EHR system interaction, or patient communication.
Pierre Dantas, Lucas Cordeiro, Waldir Junior · Jun 17, 2026 · Citations: 0
The 11 textual LD benchmarks are fully preserved, with no regression.
Gulshan Saleem, Nisar Ahmed, Muhammad Imran Zaman, Ali Hassan · Jun 17, 2026 · Citations: 0
Evaluation on 5,080 samples across GPT-4o, Llama 3, and Mistral 7B shows that the framework reduces Attack Success Rate (ASR) from 71.4\% to 11.3\%, outperforming the best single-layer baseline by 27.3 percentage points and a published…
Yuhang Zhou, Lizhu Zhang, Yifan Wu, Mingyi Wang, Bo Peng · Jun 17, 2026 · Citations: 0
On-policy distillation (OPD) improves student models by training them on trajectories induced by their own policy, making it a promising approach for mitigating exposure bias in agent training.
Vinicius Covas · Jun 17, 2026 · Citations: 0
The study contributes a multilingual corpus in Portuguese, Spanish, English, and French; a nine-frame narrative taxonomy with cue-based frame annotation; a reproducible annotation pipeline combining LLM-assisted suggestion with human…
Yunkai Xu, Saeed Abdullah · Jun 17, 2026 · Citations: 0
LLM judge models often exhibit inaccuracies in assessing depression severity in non-English texts, with performance varying across different models.
David M. Smiley · Jun 17, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Priyanshi Garg, Ishita Rao, Jieqiong Ding, Amandalynne Paullada · Jun 17, 2026 · Citations: 0
We show how governance constraints, ICD-based cohort selection, single-annotator labeling, and hospital-stay-level aggregation produce labels that reflect clinician-documented judgments, treat suicidality as a bounded episode, and assume…
Antonio de Sousa Leitão Filho; Allan Kardec Duailibe Barros Filho; Fabrício Saul Lima; Selby Mykael Lima dos Santos; Rejani Bandeira Vieira Sousa · Jun 17, 2026 · Citations: 0
Intrinsic evaluation covers four properties verifiable by construction -- ontological atomicity, dimensional equivalence, typographic robustness, and numerical reconstruction -- over an internal, physically validated benchmark (EngQuant,…
Glenn Matlin, Chandreyi Chakraborty, Saehee Eom, Mika Okamoto, Rayan Castilla · Jun 17, 2026 · Citations: 0
Training-data attribution measures how strongly each training document influences a model's predictions on a benchmark, but document-level scores are too noisy to identify which corpus regions support which capabilities, and prior work has…
Vu Nguyen Nguyen Xuan, Huy Ngo Quang · Jun 17, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Gregory Matsnev · Jun 17, 2026 · Citations: 0
Recent position papers argue that the classical aleatoric/epistemic uncertainty framework is insufficient for interactive large language model (LLM) agents and call for underspecification-aware, decomposed, and communicable uncertainty…
Miloš Nikolić, Ali Hadi Zadeh, Enrique Torres Sanchez, Andreas Moshovos · Jun 17, 2026 · Citations: 0
Fidelity metrics, such as per-token KL divergence (KLD) against a high-precision reference, are often used in practice as low-cost proxies for benchmark quality.
Lee Sangmyeong, Shun Inadumi, Koichiro Yoshino · Jun 17, 2026 · Citations: 0
We introduce Language and Vision Structural Ambiguity (LaViSA), a benchmark designed to evaluate the ability of VLMs to resolve structural ambiguity leveraging visual scenes.
Justin D. Norman, Michael U. Rivera, D. Alex Hughes · Jun 17, 2026 · Citations: 0
We present the largest systematic evaluation of LLM-as-a-Judge to date: 21 judges from nine providers across MT-Bench, JudgeBench, and RewardBench, evaluated under three protocols (agreement, consistency, bias audit) over 118 runs and…
Yueyi Sun, Yuhao Wang, Jason Li, Ye Tian, Tao Zhang · Jun 17, 2026 · Citations: 0
To systematically evaluate the parallelism property of visual perception capability for DLMs, we construct a new Parallel Detailed Localized Captioning Benchmark (ParaDLC-Bench) by scaling the DLC-Bench to include multiple region masks per…
Aijie Shu, Bowei Chen, Wenbin Wu, Cathy Yi-Hsuan Chen, Fengxiang He · Jun 17, 2026 · Citations: 0
Thomas Bertolani, Davide Bucciarelli, Leonardo Zini, Marcella Cornia, Lorenzo Baraldi · Jun 17, 2026 · Citations: 0
Teagan Johnson, Elliott Ash, Andrew Piper, Maria Antoniak · Jun 17, 2026 · Citations: 0
Zhenghao Xing, Ruiyang Xu, Yuxuan Wang, Jinzheng He, Ziyang Ma · Jun 17, 2026 · Citations: 0
Yingshan Susan Wang, Cedegao E. Zhang, Linlu Qiu, Zexue He, Pengyuan Li · Jun 17, 2026 · Citations: 0
Denis Peskoff, Joe Barrow, Christopher Vu, Diag Davenport · Jun 17, 2026 · Citations: 0
Siyi Gu, Jialin Chen, Sophia Zhou, Arman Cohan, Rex Ying · Jun 17, 2026 · Citations: 0
Leyang Shen, Yang Zhang, Xiaoyan Zhao, Chun Kai Ling, Tat-Seng Chua · Jun 17, 2026 · Citations: 0
Ikram Belmadani, Oumaima El Khettari, Carlos Ramisch, Frederic Bechet, Richard Dufour · Jun 17, 2026 · Citations: 0
Sanghyeok Choi, Henry Gouk, Esmeralda S. Whitammer · Jun 17, 2026 · Citations: 0
Zirui Wu, Lin Zheng, Jiacheng Ye, Shansan Gong, Xueliang Zhao · Jun 17, 2026 · Citations: 0
Haipeng Luo, Qingfeng Sun, Songli Wu, Can Xu, Wenfeng Deng · Jun 17, 2026 · Citations: 0
Pushwitha Krishnappa, Amit Das, Vinija Jain, Aman Chadha, Tathagata Mukherjee · Jun 17, 2026 · Citations: 0
Soheyl Bateni, Maryam Abdolali · Jun 17, 2026 · Citations: 0
Shiho Matta, Yin Jou Huang, Fei Cheng, Takashi Kodama, Hirokazu Kiyomaru · Jun 17, 2026 · Citations: 0
Sakshi Joshi, Dhruv Subhash Rathi, Sanskar Singh, Eldho Ittan George, R J Hari · Jun 17, 2026 · Citations: 0
Jingyi Zhou, Senlin Luo, Haofan Chen · Jun 17, 2026 · Citations: 0
Ramza Basharat, Muhammad Usman Ali · Jun 17, 2026 · Citations: 0
Hui Zhang, Shuren Song · Jun 17, 2026 · Citations: 0
Haewoon Kwak · Jun 17, 2026 · Citations: 0
Qiuyu Fang, Jiayi Hao, Chengzhi Zhang · Jun 17, 2026 · Citations: 0
Mengyu Ye, Keito Kudo, Wataru Ikeda, Ryosuke Matsuda, Keisuke Sakaguchi · Jun 17, 2026 · Citations: 0
Zhuoran Li, Rui Xu, Jian Yang, Junnan Liu, Zhijun Chen · Jun 17, 2026 · Citations: 0
Fengying Ye, Yanming Sun, Runzhe Zhan, Zheqi Zhang, Lidia S. Chao · Jun 17, 2026 · Citations: 0
Yafeng Wu, Huu Hiep Nguyen, Thin Nguyen, Hung Le · Jun 17, 2026 · Citations: 0
Franziska Braun, Christopher Witzl, Andreas Erzigkeit, Hartmut Lehfeld, Thomas Hillemacher · Jun 17, 2026 · Citations: 0
Salim Khazem · Jun 17, 2026 · Citations: 0
Yuliang Zhan, Xinyu Tang, Jian Li, Dandan Zheng, Weilong Chai · Jun 17, 2026 · Citations: 0
Emmanuel Aboah Boateng, Kyle MacDonald, Amardeep Kumar, Siddharth Kodwani, Sudeep Das · Jun 17, 2026 · Citations: 0
Jingkun Luo, Yifan Sun, Da-Tian Peng, Guanxiong Pei · Jun 17, 2026 · Citations: 0
Jasmine Owers, Edwin Simpson, Martha Lewis · Jun 17, 2026 · Citations: 0
Yuanxin Liu, Ruida Zhou, Xinyan Zhao, Amr Sharaf, Hongzhou Lin · Jun 17, 2026 · Citations: 0
Ziyi Zhu, Luka Smyth, Saki Shinoda, Jinghong Chen · Jun 17, 2026 · Citations: 0
Zhuangzhuang Pan, Ning Dong, Yingna Su, Yan Xia · Jun 17, 2026 · Citations: 0
Adrian Cosma, Nicoleta-Nina Basoc, Andrei Niculae, Cosmin Dumitrache, Emilian Radoi · Jun 17, 2026 · Citations: 0
Wen-Fong, Huang, Edwin Simpson · Jun 17, 2026 · Citations: 0
Nicolas Floquet, Joseph Le Roux, Nadi Tomeh · Jun 17, 2026 · Citations: 0
Wicaksono Leksono Muhamad, Yunita Sari · Jun 17, 2026 · Citations: 0
Bohou Zhang, Xiaoyu Tao, Mingyue Cheng, Huijie Liu, Qi Liu · Jun 17, 2026 · Citations: 0
Xiaoyue Xu, Sikui Zhang, Xiaorong Wang, Xu Han, Chaojun Xiao · Jun 17, 2026 · Citations: 0
Zhe Ren, Yibo Yang, Yimeng Chen, Zijun Zhao, Benshuo Fu · Jun 17, 2026 · Citations: 0
Qingyu Lu, Ruochen Li, Liang Ding, Yufei Xia, Youxiang Zhu · Jun 17, 2026 · Citations: 0
Jaward Sesay, Yue Yu, Börje F. Karlsson · Jun 17, 2026 · Citations: 0
Sean Brynjólfsson, Shashvat Jayakrishnan, Esha Sali, Diptanshu Purwar, Madhav Aggarwal · Jun 17, 2026 · Citations: 0