HFEPX Archive Slice
HFEPX Daily Papers for 2026-06-18
Daily archive slice for 2026-06-18 from the HFEPX corpus. Updated from current HFEPX corpus (2026-06-20); covers 57 papers from 2026-06-18.
HFEPX Archive Slice
Daily archive slice for 2026-06-18 from the HFEPX corpus. Updated from current HFEPX corpus (2026-06-20); covers 57 papers from 2026-06-18.
Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .
High-Signal Coverage
100.0%
57 / 57 papers are not low-signal flagged.
Benchmark Anchors
21.1%
Papers with benchmark/dataset mentions in extraction output.
Metric Anchors
52.6%
Papers with reported metric mentions in extraction output.
Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.
Get this digest every Friday →
SubscribeRanked by protocol completeness and evidence density for faster period-over-period review.
Jun 18, 2026 · Citations: 0 · Score: 6.5
Eval: Automatic Metrics · Metrics: Cost, Token cost
Jun 18, 2026 · Citations: 0 · Score: 6.5
Eval: Automatic Metrics · Metrics: Accuracy, Exact match
Jun 18, 2026 · Citations: 0 · Score: 6.5
Eval: Automatic Metrics · Metrics: Accuracy, Perplexity
Jun 18, 2026 · Citations: 0 · Score: 6.5
Eval: Automatic Metrics · Metrics: Accuracy, F1
Jun 18, 2026 · Citations: 0 · Score: 6.5
Eval: Automatic Metrics · Metrics: Accuracy, Cost
Jun 18, 2026 · Citations: 0 · Score: 6.5
Eval: Automatic Metrics · Metrics: Accuracy
Quickly compare method ingredients across this archive slice.
| Paper | Eval Modes | Benchmarks | Metrics | Quality Controls |
|---|---|---|---|---|
| Beyond Global Replanning: Hierarchical Recovery for Cross-Device Agent Systems Jun 18, 2026 | Automatic Metrics | Herabench | Cost, Token cost | Not reported |
| Source-Grounded Data Generation for Text-to-JSON Learning Jun 18, 2026 | Automatic Metrics | Stage Eval | Accuracy, Exact match | Not reported |
| GEMS: Geometric Constraints Enable Multi-Semantic Superposition in LLMs Jun 18, 2026 | Automatic Metrics | GSM8K | Accuracy, Perplexity | Not reported |
| CREDENCE: Claim Reduction for Decomposition & Enhanced Credibility -- Semantic Metrics and Convergence Analysis Jun 18, 2026 | Automatic Metrics | Wikisplitbench, Claimdecompbench | Accuracy, F1 | Not reported |
| Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning Jun 18, 2026 | Automatic Metrics | CommonsenseQA | Accuracy, Cost | Not reported |
| AgentFinVQA: A Deployable Multi-Agent Pipeline for Auditable Financial Chart QA Jun 18, 2026 | Automatic Metrics | ChartQA | Accuracy | Not reported |
| NEST: Narrative Event Structures in Time for Long Video Understanding Jun 18, 2026 | Automatic Metrics | Needle In A Haystack | F1 | Not reported |
| Your Mouse and Eyes Secretly Leak Your Preference: LLM Alignment using Implicit Feedback from Users Jun 18, 2026 | Automatic Metrics | Not reported | Accuracy | Not reported |
| The Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse Jun 18, 2026 | Automatic Metrics | Not reported | Accuracy | Calibration |
| Benchmarking Agentic Review Systems Jun 18, 2026 | Automatic Metrics | Not reported | Accuracy, Recall | Not reported |
Gap: Human feedback
Human feedback is present in 10 of 57 papers.
Gap: Quality controls
Quality controls is present in 2 of 57 papers.
Gap: Benchmarks
Benchmarks is present in 12 of 57 papers.
Strong: Metrics
Metrics is present in 30 of 57 papers.
Gap: Known rater population
Known rater population is present in 2 of 57 papers.
Moderate: Known annotation unit
Known annotation unit is present in 12 of 57 papers.
Evaluation Modes
Top Metrics
Top Benchmarks
Quality Controls
Md Nayem Uddin, Amir Saeidi, Eduardo Blanco, Chitta Baral · Jun 18, 2026 · Citations: 0
Policy-adherent tool-calling agents in customer-service domains must maintain task states across turns while calling tools and obeying domain policies.
Shaghayegh Kolli, Timo Cavelius, Nafiseh Nikeghbal, Samantha Dalal, Jana Diesner · Jun 18, 2026 · Citations: 0
Multimodal large language models (MLLMs) are increasingly deployed in personally and societally consequential settings, yet the visual cues that shape how these models judge people remain poorly understood.
Shu Yao, Yuhua Luo, Qian Long, Jingru Fan, Zhuoyuan Yu · Jun 18, 2026 · Citations: 0
We propose H-RePlan, a hierarchical replanning framework for multi-device agents with unified API--CLI--GUI execution.
Haw-Shiuan Chang, Jeffrey Gomez, Mehul Patwari, Aryan Sajith, Hamed Zamani · Jun 18, 2026 · Citations: 0
To align a Large Language Model (LLM), most existing methods collect explicit human feedback and train a reward model to predict the human preference based on the response text.
Yusuf Salcan, Simon Ging, Robin Schirrmeister, Philipp Arnold, Elmar Kotter · Jun 18, 2026 · Citations: 0
On external VQA benchmarks (Slake, VQA-RAD), RadGrounder achieves competitive results with specialized medical VLMs.
Helena Bonaldi, Genoveffa Martone, Marco Guerini · Jun 18, 2026 · Citations: 0
While LLMs represent a scalable solution for assisting humans in the generation of counterspeech for both threats, zero-shot models frequently generate repetitive and vague responses, underscoring the need for high-quality examples to steer…
Shiguo Lian, Kai Wang, Zhaoxiang Liu, Wen Liu, Minjie Hua · Jun 18, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Wei Xia, Jin Wu, Haoran Shi, Xiangyu Wang, Chanjin Zheng · Jun 18, 2026 · Citations: 0
PsyScore comprises three key modules: a Trait-Adaptive Neural IRT Scorer that incorporates the Graded Partial Credit Model (GPCM) into a neural architecture, enabling the precise estimation of student ability while maintaining psychometric…
Celestine Achi · Jun 18, 2026 · Citations: 0
We introduce the Meaning Intelligence Framework (MIF), a nine-dimension annotation and evaluation schema for Nigerian public discourse that separates surface sentiment from true communicative intent.
Abdul Rafay Syed · Jun 18, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Josef Jon, Ondřej Bojar · Jun 18, 2026 · Citations: 0
The dataset is designed to support the evaluation of machine translation systems that aim to preserve document formatting during translation.
Jelena Meyer, David Garcia, Dirk U. Wulff · Jun 18, 2026 · Citations: 0
Psychological instruments designed for humans are increasingly used to assign large language models (LLMs) stable psychological profiles that affect their usability, safety assessment, and use as proxies for human participants in research.
Augustin Bouquillard, Florent Jacquemard · Jun 18, 2026 · Citations: 0
We present evaluations conducted on datasets comprising a variety of digital musical scores: jazz lead sheets taken from the Real Book, transcriptions of recordings of jazz soli and bass lines, traditional tunes, as well as classical scores…
Maxim Melichov, Yakov Kolani, Morris Alper · Jun 18, 2026 · Citations: 0
Results on existing Hebrew G2P benchmarks and the new targeted MILIM benchmark for spoken Hebrew show that ReNikud surpasses previous state-of-the-art methods.
Aueaphum Aueawatthanaphisut · Jun 18, 2026 · Citations: 0
The framework coordinates specialized agents for clinical text, longitudinal EHR, medical imaging, physiological sensor signals, guideline retrieval, uncertainty auditing, and referral planning.
Morris Alper, Vasudha Varadarajan, Moran Yanuka, Angelina Wang, Hadar Averbuch-Elor · Jun 18, 2026 · Citations: 0
To benchmark this task, we present the NAMESAKES dataset of over one thousand names and faces of public figures spanning a wide range of fame levels, along with perturbed, less famous names.
Jiaxu Zuo, Mu You, Kaixin Lan, Tao Fang, Yujia Huo · Jun 18, 2026 · Citations: 0
Recent advances in Large Language Models (LLMs) have substantially transformed Automated Essay Scoring (AES), yet the internal mechanisms underlying LLM-based scoring remain poorly understood.
Po-Chin Chang, Nicholas Hogan, Aske Plaat, Michiel T. van der Meer · Jun 18, 2026 · Citations: 0
The simulation benchmark shows the router outperforming two static baselines (0.694 vs.
Masaya Kawamura, Yuma Shirahata, Kentaro Mitsui, Reo Shimizu · Jun 18, 2026 · Citations: 0
Further, PASQA shows stronger agreement with human accent-correctness judgments.
Elroy Galbraith · Jun 18, 2026 · Citations: 0
On the CRAG benchmark (1371 validation questions) we (i) measure the distribution of stabilization, (ii) derive a model-agnostic bound H on the portion of tool latency that can be hidden behind the user's remaining input, as a function of…
Zhentao Tan, Wei Chen, Jingyi Shen, Yao Liu, Xu Shen · Jun 18, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
William Guey, Pierrick Bougault · Jun 18, 2026 · Citations: 0
Across four mid-tier model families and 85 author-versus-fresh comparisons, we find no detectable self-preference: authors reject verified-good fixes to their own drafts at essentially the same rate as fresh models judging the same drafts…
Arash Ghafouri, Mahdi Firouzmandi, Hossein Saberi, Mohammad Reza Hasani Ahangar · Jun 18, 2026 · Citations: 0
Persian pretrained language models (PLMs) are still limited by the scarcity of large-scale, high-quality pretraining corpora and by insufficient evaluation beyond standard classification and NER tasks.
Xinghao Chen, Chak Tou Leong, Wenjin Guo, Jian Wang, Wenjie Li · Jun 18, 2026 · Citations: 0
To measure these effects, we introduce the Unified Latent Probe (ULP), which quantifies the mutual information between latent trajectories and explicit reasoning steps.
Sunghee Ahn, Guijin Son, Youngjae Yu · Jun 18, 2026 · Citations: 0
Evaluations on STAGE-Eval, our source-grounded benchmark with an 851-example test set, show that STAGE produces stronger training data than existing approaches.
Pratyush Kumar · Jun 18, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Kaiyue Yang, Yuyan Bu, Jingwei Yi, Yuchi Wang, Biyu Zhou · Jun 18, 2026 · Citations: 0
As LLM agents increasingly select tools autonomously, their choices among tools with different privileges become safety-relevant.
Yanxi Chen, Weijie Shi, Yuexiang Xie, Boyi Hu, Yaliang Li · Jun 18, 2026 · Citations: 0
This work presents a general framework for training large language models (LLMs) to "Connect the Dots" (CoD), a meta-capability required by long-lifecycle agents: as an LLM-based AI agent gets deployed in an environment, it solves a long…
Yongqi Shao, Hong Huo, Flavio Bertini, Danilo Montesi, Tao Fang · Jun 18, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Masato Takagi, Masaya Kawamura, Reo Shimizu, Yuma Shirahata · Jun 18, 2026 · Citations: 0
We obtained MOS predictions for these speech samples from both human listeners and the model, and analyzed the differences in their perceptual characteristics.
Yu Deng · Jun 18, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
To Eun Kim, Xuhong He, Dishank Jain, Ambuj Agrawal, Negar Arabzadeh · Jun 18, 2026 · Citations: 0
The decentralized deployment of LLM agents with diverse capabilities across diverse tasks motivates infrastructure for knowledge sharing across heterogeneous agent populations.
Syeda Faiza Ahmed Sara, Shammur Absar Chowdhury · Jun 18, 2026 · Citations: 0
Cross-dataset evaluation on L2-ARCTIC shows consistent gains.
Guneesh Vats, Anubha Agrawal, Shikha Singhal, Ajita Dash, Praison Selvaraj · Jun 18, 2026 · Citations: 0
We present REDACT, a systematically controlled multilingual PII benchmark with 13,427 records, 324,078 entity annotations, 51 entity types, 4,127 surface-form patterns, and 25 languages across 9 scripts.
Serge Sharoff · Jun 18, 2026 · Citations: 0
The increasing prominence of Large Language Models (LLMs) in public discourse presents both opportunities and challenges for democratic deliberation.
Jiayi Zhu, Haoxuan Peng, Junxi Wang, Liang Ke, Chen Zhang · Jun 18, 2026 · Citations: 0
Large language models (LLMs) are commonly prompted and interfaced with human-readable natural language, even when the intended reader is another model.
Aman Pathak, Cheng Peng, Mengxian Lyu, Ziyi Chen, Reema Solan · Jun 18, 2026 · Citations: 0
In this study, we developed a zero-shot, agentic workflow, and evaluated five open-source generative Large Language Models (LLMs) to populate 13 College of American Pathologists synoptic fields from lung resection pathology reports.
Yanyu Yao, Shangze Li, Zhi Zheng, Hui Zheng, Qi Liu · Jun 18, 2026 · Citations: 0
Experiments on the LoCoMo benchmark confirm that AtomMem achieves state-of-the-art performance across various reasoning tasks, offering a scalable and economically viable solution for deploying intelligent personalized agents.
Hongliang Liu · Jun 18, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Jianwen Sun, Chuanhao Li, Zizhen Li, Yukang Feng, Fanrui Zhang · Jun 18, 2026 · Citations: 0
We present JamSet and JamBench, the first project-level game code framework dataset and benchmark built on a professional game engine.
Phuong Huu Vu Tran, Thuan Duc Mai, Bach Xuan Le · Jun 18, 2026 · Citations: 0
We present Credence, a revised claim decomposition and evaluation framework addressing both shortcomings.
Jiechao Gao, Rohan Kumar Yadav, Yuangang Li, Yuandong Pan, Jie Wang · Jun 18, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Sajib Acharjee Dip, Dawei Zhou, Liqing Zhang · Jun 18, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Yuxu Zhou, Ondřej Kuželka, Yuyi Wang, Yuanhong Wang, Yi Chang · Jun 18, 2026 · Citations: 0
We present CombEval, a dynamic benchmark for evaluating combinatorial counting in large language models.
Aravind Narayanan, Shaina Raza · Jun 18, 2026 · Citations: 0
Yet existing chart-QA agents are accuracy-focused and opaque, and most assume proprietary API access; to our knowledge, none combines auditability with on-premise deployability without significant accuracy compromise.
Darrien McKenzie, Nicklas Hansen, Xiaolong Wang · Jun 18, 2026 · Citations: 0
Empirically, we find that different sampling strategies induce non-trivial tradeoffs between productivity (learning signal), diversity (coverage of the task manifold), and utility (evaluation relevance).
Dang Nguyen, Wanqing Hao, Yanai Elazar, Chenhao Tan · Jun 18, 2026 · Citations: 0
A new class of agentic review systems are emerging as a remedy to the pressure placed on peer review systems by AI-assisted research, but it is unclear how they should be evaluated.
Pranav Bhandari, Nicolas Fay, Amitava Datta, Usman Naseem, Mehwish Nasim · Jun 18, 2026 · Citations: 0
Aligning language models with human preferences often requires optimising multiple behavioural objectives.
Punit Kumar Singh, Niladri Ghosh, Advait Joshiınst, Shailee Choudhary, Michael Färber · Jun 18, 2026 · Citations: 0
To address this gap, we present NRITYAM, a comprehensive benchmark for evaluating the cultural comprehension capabilities of language models in the context of global dance traditions.
Aditeya Baral, Radoslav Ralev, Iliya Sotirov Zhechev, Srijith Rajamohan, Jen Agarwal · Jun 18, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Elijah Feldman, Dipak Meher, Carlotta Domeniconi · Jun 18, 2026 · Citations: 0
Court proceedings contain valuable evidence about human smuggling networks, but this information is often buried within unstructured, jargon-heavy legal documents.
Ali Asgarov, Kaushik Narasimhan, Najibul Haque Sarker, Hani Alomari, Chia-Wei Tang · Jun 18, 2026 · Citations: 0
Existing long video benchmarks focus on needle-in-a-haystack retrieval rather than evaluating how low-level actions form events, how events interact across time, and how narratives progress, for example, whether a model can connect an early…
Jyotsna Singh, Ash Black, Jeff Larsen, Scott R. Saleska · Jun 18, 2026 · Citations: 0
Researchers are interested in learning about Mars so that it may eventually become habitable for humans.
Jason Potteiger · Jun 18, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Yanhong Li, Anej Svete, Ashish Sabharwal, William Merrill · Jun 18, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Jeonghyun Park, Seunghyun Yoon, Yonghyun Jun, Hwanhee Lee · Jun 18, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Kaizhen Tan, Rong Gu, Mingyuan Li · Jun 18, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.