HFEPX Archive Slice

HFEPX Fortnight Archive: 2025-F13

Updated from current HFEPX corpus (Apr 17, 2026). 44 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 17, 2026). 44 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequent quality control: Adjudication. Frequently cited benchmark: DROP. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Jun 26, 2025.

Papers: 44 Last published: Jun 26, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

High-Signal Coverage

100.0%

44 / 44 papers are not low-signal flagged.

Benchmark Anchors

13.6%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

31.8%

Papers with reported metric mentions in extraction output.

1 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.

Get this digest every Friday →

Why This Time Slice Matters

13.6% of papers report explicit human-feedback signals, led by expert verification.
automatic metrics appears in 29.5% of papers in this hub.
DROP is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is adjudication (2.3% of papers).
Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

An Agentic System for Rare Disease Diagnosis with Traceable Reasoning
Jun 25, 2025 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Recall, Agreement
LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning
Jun 23, 2025 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: Coherence
PersonalAI: A Systematic Comparison of Knowledge Graph Storage and Retrieval Approaches for Personalized LLM agents
Jun 20, 2025 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: Accuracy
SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling
Jun 18, 2025 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: Accuracy, Precision
AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents
Jun 17, 2025 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: Cost
DistillNote: Toward a Functional Evaluation Framework of LLM-Generated Clinical Note Summaries
Jun 20, 2025 · Citations: 0 · Score: 4.5

Eval: Llm As Judge, Automatic Metrics · Metrics: Auroc

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
An Agentic System for Rare Disease Diagnosis with Traceable Reasoning Jun 25, 2025	Automatic Metrics	Not reported	Recall, Agreement	Adjudication
LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning Jun 23, 2025	Automatic Metrics	LMSYS Chatbot Arena, Writingbench	Coherence	Not reported
PersonalAI: A Systematic Comparison of Knowledge Graph Storage and Retrieval Approaches for Personalized LLM agents Jun 20, 2025	Automatic Metrics	HotpotQA, TriviaQA	Accuracy	Not reported
SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling Jun 18, 2025	Automatic Metrics	GSM8K, Processbench	Accuracy, Precision	Not reported
AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents Jun 17, 2025	Automatic Metrics	DROP	Cost	Not reported
DistillNote: Toward a Functional Evaluation Framework of LLM-Generated Clinical Note Summaries Jun 20, 2025	Llm As Judge, Automatic Metrics	Not reported	Auroc	Not reported
MindCube: Spatial Mental Modeling from Limited Views Jun 26, 2025	Automatic Metrics, Simulation Env	Not reported	Accuracy	Not reported
Complexity-aware fine-tuning Jun 26, 2025	Automatic Metrics	Not reported	Accuracy, Cost	Not reported
TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems Jun 24, 2025	Automatic Metrics	Not reported	Spearman	Not reported
Long-Context Generalization with Sparse Attention Jun 19, 2025	Automatic Metrics	Not reported	Perplexity	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (13.6% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (2.3% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (4.5% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (18.2% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (6.8% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (6.8% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 2.3% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (6.8% coverage).
Annotation unit is under-specified (6.8% coverage).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (DROP vs GSM8K) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: DROP Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 2.3% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (6.8% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (13)
Simulation Env (2)
Llm As Judge (1)

Top Metrics

Accuracy (5)
Cost (2)
Recall (2)
Agreement (1)

Top Benchmarks

DROP (1)
GSM8K (1)
Processbench (1)

Quality Controls

Adjudication (1)

Papers In This Archive Slice

Theory of Mind in Action: The Instruction Inference Task in Dynamic Human-Agent Collaboration
Fardin Saad, Pradeep K. Murukannaiah, Munindar P. Singh · Jun 26, 2025 · Citations: 0
MindCube: Spatial Mental Modeling from Limited Views
Qineng Wang, Baiqiao Yin, Pingyue Zhang, Jianshu Zhang, Kangrui Wang · Jun 26, 2025 · Citations: 0

Can Vision-Language Models (VLMs) imagine the full scene from just a few views, like humans do?
Complexity-aware fine-tuning
Andrey Goncharov, Daniil Vyazhev, Petr Sychev, Edvard Khalafyan, Alexey Zaytsev · Jun 26, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Multi-lingual Functional Evaluation for Large Language Models
Victor Ojewale, Inioluwa Deborah Raji, Suresh Venkatasubramanian · Jun 25, 2025 · Citations: 0

Multi-lingual competence in large language models is often evaluated via static data benchmarks such as Belebele, M-MMLU and M-GSM.
Cognitive models can reveal interpretable value trade-offs in language models
Sonia K. Murthy, Rosie Zhao, Jennifer Hu, Sham Kakade, Markus Wulfmeier · Jun 25, 2025 · Citations: 0
$π$-CoT: Prolog-Initialized Chain-of-Thought Prompting for Multi-Hop Question-Answering
Chao Wan, Albert Gong, Mihir Mishra, Carl-Leander Henneking, Claas Beger · Jun 25, 2025 · Citations: 0

Extensive experiments demonstrate that π-CoT significantly outperforms standard RAG and in-context CoT on multi-hop question-answering benchmarks.
An Agentic System for Rare Disease Diagnosis with Traceable Reasoning
Weike Zhao, Chaoyi Wu, Yanjie Fan, Xiaoman Zhang, Pengcheng Qiu · Jun 25, 2025 · Citations: 0

Expert Verification Multi Agent

Here we present DeepRare, a multi-agent system for rare disease differential diagnosis decision support powered by large language models, integrating over 40 specialized tools and up-to-date knowledge sources.
Bridging Compositional and Distributional Semantics: A Survey on Latent Semantic Geometry via AutoEncoder
Yingji Zhang, Danilo S. Carvalho, André Freitas · Jun 25, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality
Baochang Ren, Shuofei Qiao, Da Zheng, Huajun Chen, Ningyu Zhang · Jun 24, 2025 · Citations: 0

Experimental results on three hallucination evaluation datasets and two reasoning evaluation datasets demonstrate that KnowRL effectively mitigates hallucinations in slow-thinking models while maintaining their original strong reasoning…
Breaking Barriers: Do Reinforcement Post Training Gains Transfer To Unseen Domains?
Chuxuan Hu, Yuxuan Zhu, Antony Kellermann, Caleb Biddulph, Suppakit Waiwitlikhit · Jun 24, 2025 · Citations: 0
TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems
Christoph Minixhofer, Ondrej Klejch, Peter Bell · Jun 24, 2025 · Citations: 0

Evaluation of Text to Speech (TTS) systems is challenging and resource-intensive.
LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning
Yuhao Wu, Yushi Bai, Zhiqiang Hu, Roy Ka-Wei Lee, Juanzi Li · Jun 23, 2025 · Citations: 0

Experimental evaluations show that our LongWriter-Zero model, trained from Qwen2.5-32B, consistently outperforms traditional SFT methods on long-form writing tasks, achieving state-of-the-art results across all metrics on WritingBench and…
Programming by Backprop: An Instruction is Worth 100 Examples When Finetuning LLMs
Jonathan Cook, Silvia Sapora, Arash Ahmadian, Akbir Khan, Tim Rocktaschel · Jun 23, 2025 · Citations: 0

Demonstrations

Though execution of instructions in training data remains less reliable than when instructions are given in-context, our results demonstrate that procedural knowledge can be noisily `programmed' into LLMs through PBB, with important…
Context Biasing for Pronunciation-Orthography Mismatch in Automatic Speech Recognition
Christian Huber, Alexander Waibel · Jun 23, 2025 · Citations: 0
Parallel Continuous Chain-of-Thought with Jacobi Iteration
Haoyi Wu, Zhihao Teng, Kewei Tu · Jun 23, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
A Simple "Motivation" Can Enhance Reinforcement Finetuning of Large Reasoning Models
Junjie Zhang, Guozheng Ma, Shunyu Liu, Haoyu Wang, Jiaxing Huang · Jun 23, 2025 · Citations: 0

This motivates us to explore if large reasoning models can benefit from a motivation of the task, i.e., awareness of the reward function, during the reinforcement finetuning process, as we humans sometimes do when learning.
Improving Black-Box Generative Attacks via Generator Semantic Consistency
Jongoh Jeong, Hunmin Yang, Jaeseok Jeong, Kuk-Jin Yoon · Jun 23, 2025 · Citations: 0
PDF Retrieval Augmented Question Answering
Thi Thu Uyen Hoang, Viet Anh Nguyen · Jun 22, 2025 · Citations: 0

We provide an in-depth experimental evaluation of our solution, demonstrating its capability to extract accurate information that can be applied to different types of content across PDFs.
LLM Probability Concentration: How Alignment Shrinks the Generative Horizon
Chenghao Yang, Sida Li, Ari Holtzman · Jun 22, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Accelerating Residual Reinforcement Learning with Uncertainty Estimation
Lakshita Dodeja, Karl Schmeckpeper, Shivam Vats, Thomas Weng, Mingxi Jia · Jun 21, 2025 · Citations: 0
Towards AI Search Paradigm
Yuchen Li, Hengyi Cai, Rui Kong, Xinran Chen, Jiamin Chen · Jun 20, 2025 · Citations: 0

In this paper, we introduce the AI Search Paradigm, a comprehensive blueprint for next-generation search systems capable of emulating human information processing and decision-making.
PersonalAI: A Systematic Comparison of Knowledge Graph Storage and Retrieval Approaches for Personalized LLM agents
Mikhail Menschikov, Dmitry Evseev, Victoria Dochkina, Ruslan Kostoev, Ilia Perepechkin · Jun 20, 2025 · Citations: 0

We evaluate our system on three benchmarks: TriviaQA, HotpotQA, DiaASQ and demonstrate that different memory and retrieval configurations yield optimal performance depending on the task.
Multimodal Fused Learning for Solving the Generalized Traveling Salesman Problem in Robotic Task Planning
Jiaqi Chen, Mingfeng Fan, Xuefeng Zhang, Jingsong Liang, Yuhong Cao · Jun 20, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
DistillNote: Toward a Functional Evaluation Framework of LLM-Generated Clinical Note Summaries
Heloisa Oss Boll, Antonio Oss Boll, Leticia Puttlitz Boll, Ameen Abu Hanna, Iacer Calixto · Jun 20, 2025 · Citations: 0

Expert Verification

This study introduces DistillNote, an evaluation framework for LLM summaries that targets their functional utility by applying the generated summary downstream in a complex clinical prediction task, explicitly quantifying how much…
Long-Context Generalization with Sparse Attention
Pavlo Vasylenko, Hugo Pitorro, André F. T. Martins, Marcos Treviso · Jun 19, 2025 · Citations: 0

Our empirical evaluation on synthetic tasks and language modeling demonstrates that ASEntmax substantially outperforms softmax, scalable softmax, and fixed-temperature α-entmax baselines, achieving up to 1000\times length extrapolation on…
A Scoping Review of Synthetic Data Generation by Language Models in Biomedical Research and Application: Data Utility and Quality Perspectives
Hanshu Rao, Weisi Liu, Haohan Wang, I-Chan Huang, Zhe He · Jun 19, 2025 · Citations: 0

Evaluations were heterogeneous: intrinsic metrics (27.1\%), human-in-the-loop assessments (44.1\%), and LLM-based evaluations (13.6\%).
Measuring Intent Comprehension in LLMs
Nadav Kunievsky, James A. Evans · Jun 19, 2025 · Citations: 0

People judge interactions with large language models (LLMs) as successful when outputs match what they want, not what they type.
Revela: Dense Retriever Learning via Language Modeling
Fengyu Cai, Tong Chen, Xinran Zhao, Sihao Chen, Hongming Zhang · Jun 19, 2025 · Citations: 0

We evaluate Revela on domain-specific (CoIR), reasoning-intensive (BRIGHT), and general-domain (BEIR) benchmarks across various retriever backbones.
When Does Divide and Conquer Work for Long Context LLM? A Noise Decomposition Framework
Zhen Xu, Shang Zhu, Jue Wang, Junlin Wang, Ben Athiwaratkun · Jun 19, 2025 · Citations: 0
OJBench: A Competition Level Code Benchmark For Large Language Models
Zhexu Wang, Yiping Liu, Yejie Wang, Wenyang He, Bofei Gao · Jun 19, 2025 · Citations: 0
GenRecal: Generation after Recalibration from Large to Small Vision-Language Models
Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Yu-Chiang Frank Wang, Yueh-Hua Wu · Jun 18, 2025 · Citations: 0
SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling
Md Imbesat Hassan Rizvi, Xiaodan Zhu, Iryna Gurevych · Jun 18, 2025 · Citations: 0

Long Horizon

To address this, we introduce Single-Pass Annotation with Reference-Guided Evaluation (SPARE), a novel structured framework that enables efficient per-step annotation by jointly aligning solution steps to reference solutions and determine…
DeVisE: Behavioral Testing of Medical Large Language Models
Camila Zurdo Tagliabue, Heloisa Oss Boll, Aykut Erdem, Erkut Erdem, Iacer Calixto · Jun 18, 2025 · Citations: 0

Large language models (LLMs) are increasingly applied in clinical decision support, yet current evaluations rarely reveal whether their outputs reflect genuine medical reasoning or superficial correlations.
ConLID: Supervised Contrastive Learning for Low-Resource Language Identification
Negar Foroutan, Jakhongir Saydaliev, Ye Eun Kim, Antoine Bosselut · Jun 18, 2025 · Citations: 0
Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts
Kartik Sharma, Yiqiao Jin, Vineeth Rakesh, Yingtong Dou, Menghai Pan · Jun 18, 2025 · Citations: 0

Red Team

As large language models (LLMs) are deployed in safety-critical settings, it is essential to ensure that their responses comply with safety standards.
BMFM-RNA: whole-cell expression decoding improves transcriptomic foundation models
Michael M. Danziger, Bharath Dandala, Viatcheslav Gurev, Matthew Madgwick, Sivan Ravid · Jun 17, 2025 · Citations: 0

Models trained with these objectives achieve best overall performance across CZI benchmarks, on zero-shot batch integration and linear probing cell-type annotation.
LingoLoop Attack: Trapping MLLMs via Linguistic Context and State Entrapment into Endless Loops
Jiyuan Fu, Kaixun Jiang, Lingyi Hong, Jinglun Li, Haijing Guo · Jun 17, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
RedTopic: Toward Topic-Diverse Red Teaming of Large Language Models
Jiale Ding, Xiang Zheng, Yutao Wu, Cong Wang, Wei-Bin Lee · Jun 17, 2025 · Citations: 0

Red Team

It tests LLMs with adversarial prompts to uncover vulnerabilities and improve safety alignment.
Hope Speech Detection in code-mixed Roman Urdu tweets: A Positive Turn in Natural Language Processing
Muhammad Ahmad, Muhammad Waqas, Ameer Hamza, Ildar Batyrshin, Grigori Sidorov · Jun 17, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents
Jingxu Xie, Dylan Xu, Xuandong Zhao, Dawn Song · Jun 17, 2025 · Citations: 0

Long Horizon

We introduce AgentSynth, a scalable and cost-efficient pipeline for automatically synthesizing high-quality tasks and trajectory datasets for generalist computer-use agents.
Instruction Following by Principled Boosting Attention of Large Language Models
Vitoria Guardieiro, Avishree Khare, Adam Stein, Eric Wong · Jun 16, 2025 · Citations: 0

Yet in practice these constraints can be violated under long contexts or when user-provided context conflicts with them, creating reliability and safety risks.
Language Agents for Hypothesis-driven Clinical Decision Making with Reinforcement Learning
David Bani-Harouni, Chantal Pellegrini, Ege Özsoy, Nassir Navab, Matthias Keicher · Jun 16, 2025 · Citations: 0

Expert Verification

In contrast to this, we propose to model clinical decision-making for diagnosis with a hypothesis-driven uncertainty-aware language agent, LA-CDM, that converges towards a diagnosis via repeatedly requesting and interpreting relevant tests.
DualEdit: Mitigating Safety Fallback in LLM Backdoor Editing via Affirmation-Refusal Regulation
Houcheng Jiang, Zetong Zhao, Junfeng Fang, Haokai Ma, Ruipeng Wang · Jun 16, 2025 · Citations: 0

Safety-aligned large language models (LLMs) remain vulnerable to backdoor attacks.
Dynamic Reinsurance Treaty Bidding via Multi-Agent Reinforcement Learning
Stella C. Dong, James R. Finlay · Jun 16, 2025 · Citations: 0

Multi Agent

This paper develops a novel multi-agent reinforcement learning (MARL) framework for reinsurance treaty bidding, addressing long-standing inefficiencies in traditional broker-mediated placement processes.

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now