- Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Qiyao Ma, Dechen Gao, Rui Cai, Boqi Zhao, Hanchu Zhou · Apr 8, 2026 · Citations: 0
Pairwise PreferenceRubric Rating Human EvalAutomatic Metrics
Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values.
- PubMed Reasoner: Dynamic Reasoning-based Retrieval for Evidence-Grounded Biomedical Question Answering
Yiqing Zhang, Xiaozhong Liu, Fabricio Murai · Mar 28, 2026 · Citations: 0
Expert Verification Llm As JudgeAutomatic Metrics
In this context, we introduce PubMed Reasoner, a biomedical QA agent composed of three stages: self-critic query refinement evaluates MeSH terms for coverage, alignment, and redundancy to enhance PubMed queries based on partial (metadata)…
- TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, Yun-Nung Chen · Apr 8, 2026 · Citations: 0
Red Team Automatic Metrics Long Horizon
As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces.
- Beyond Paper-to-Paper: Structured Profiling and Rubric Scoring for Paper-Reviewer Matching
Yicheng Pan, Zhiyuan Ning, Ludi Wang, Yi Du · Apr 7, 2026 · Citations: 0
Rubric Rating Automatic Metrics
To address this gap, we propose P2R, a training-free framework that shifts from implicit paper-to-paper matching to explicit profile-based matching.
- Paper Reconstruction Evaluation: Evaluating Presentation and Hallucination in AI-written Papers
Atsuyuki Miyai, Mashiro Toyooka, Zaiying Zhao, Kenta Watanabe, Toshihiko Yamasaki · Apr 1, 2026 · Citations: 0
Rubric Rating Automatic Metrics
We introduce Paper Reconstruction Evaluation (PaperRecon), an evaluation framework in which an overview (overview.md) is created from an existing paper, after which an agent generates a full paper based on the overview and minimal…
- Rethinking Atomic Decomposition for LLM Judges: A Prompt-Controlled Study of Reference-Grounded QA Evaluation
Xinran Zhang · Mar 30, 2026 · Citations: 0
Rubric Rating Automatic Metrics
Atomic decomposition -- breaking a candidate answer into claims before verifying each against a reference -- is a widely adopted design for LLM-based reference-grounded judges.
- Do Phone-Use Agents Respect Your Privacy?
Zhengyang Tang, Ke Ji, Xidong Wang, Zihan Ye, Xinyuan Wang · Apr 1, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
We study whether phone-use agents respect privacy while completing benign mobile tasks.
- LLM-as-a-Judge for Time Series Explanations
Preetham Sivalingam, Murari Mandal, Saurabh Deshpande, Dhruv Kumar · Apr 2, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics
Although modern models generate textual interpretations of numerical signals, existing evaluation methods are limited: reference based similarity metrics and consistency checking models require ground truth explanations, while traditional…
- Navigating Large-Scale Document Collections: MuDABench for Multi-Document Analytical QA
Zhanli Li, Yixuan Cao, Lvzhou Luo, Ping Luo · Apr 24, 2026 · Citations: 0
Automatic Metrics Multi Agent
We present MuDABench, a benchmark for multi-document analytical QA, where questions require extracting and synthesizing information across numerous documents to perform quantitative analysis.
- FlowForge: A Staged Local Rollout Engine for Flow-Field Prediction
Xiaowen Zhang, Ziming Zhou, Fengnian Zhao, David L. S. Hung · Apr 21, 2026 · Citations: 0
Automatic Metrics Long Horizon
We introduce FlowForge, a staged local rollout engine that predicts future flow fields by compiling a locality-preserving update schedule and executing it with a shared lightweight local predictor.
- Don't Overthink It: Inter-Rollout Action Agreement as a Free Adaptive-Compute Signal for LLM Agents
Khushal Sethi · Apr 9, 2026 · Citations: 0
Automatic Metrics Long Horizon
We introduce TrACE (Trajectorical Adaptive Compute via agrEement), a training-free controller that allocates LLM calls adaptively across agent timesteps by measuring inter-rollout action agreement.
- MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents
Shu Wang, Edwin Yu, Oscar Love, Tom Zhang, Tom Wong · Apr 6, 2026 · Citations: 0
Automatic Metrics Long Horizon
Large Language Model (LLM) agents require persistent memory to maintain personalization, factual continuity, and long-horizon reasoning, yet standard context-window and retrieval-augmented generation (RAG) pipelines degrade over…
- Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents
Xuan Qi · Apr 2, 2026 · Citations: 0
Automatic Metrics Tool Use
Chain-of-thought (CoT) reasoning is widely assumed to improve agent performance, but the relationship between reasoning length and accuracy in structured tool-use settings remains poorly understood.
- OSCAR: Orchestrated Self-verification and Cross-path Refinement
Yash Shah, Abhijit Chakraborty, Naresh Kumar Devulapally, Vishnu Lokhande, Vivek Gupta · Apr 2, 2026 · Citations: 0
Automatic Metrics Long Horizon
We introduce a suite of trajectory-level assessments, including a cross-chain divergence-at-hallucination (CDH) metric, for principled comparison of localization methods.
- S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models
Jack Young · Apr 1, 2026 · Citations: 0
Automatic Metrics Long Horizon
Using roughly 48 execution-verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p < 0.001) on HumanEval.
- Asymmetric Actor-Critic for Multi-turn LLM Agents
Shuli Jiang, Zhaoyang Zhang, Yi Zhang, Shuo Yang, Wei Xia · Mar 31, 2026 · Citations: 0
Automatic Metrics Long Horizon
In many real-world applications, agents must succeed in one-shot settings where retries are impossible.
- Weakly Supervised Distillation of Hallucination Signals into Transformer Representations
Shoaib Sadiq Salehmohamed, Jinal Prashant Thakkar, Hansika Aredla, Shaik Mohammed Omar, Shalmali Ayachit · Apr 7, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics
We introduce a weak supervision framework that combines three complementary grounding signals: substring matching, sentence embedding similarity, and an LLM as a judge verdict to label generated responses as grounded or hallucinated without…
- From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company
Zhengxu Yu, Yu Fu, Zhiyuan He, Yuxuan Huang, Lee Ka Yiu · Apr 24, 2026 · Citations: 0
Automatic Metrics Multi Agent
Individual agent capabilities have advanced rapidly through modular skills and tool integrations, yet multi-agent systems remain constrained by fixed team structures, tightly coupled coordination logic, and session-bound learning.
- Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents
Seyed Moein Abtahi, Rasa Rahnema, Hetkumar Patel, Neel Patel, Majid Fekri · Apr 23, 2026 · Citations: 0
Automatic Metrics Long Horizon
The transition from stateless language model inference to persistent, multi session autonomous agents has revealed memory to be a primary architectural bottleneck in the deployment of production grade agentic systems.
- PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory
Zhifei Xie, Zongzheng Hu, Fangda Ye, Xin Zhang, Haobo Chai · Apr 9, 2026 · Citations: 0
Automatic Metrics Long Horizon
Prior work remains largely confined to laboratory settings, leaving a clear gap in real-world proactive agent: depth, complexity, ambiguity, precision and real-time constraints.
- Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency
Guan-Ting Lin, Chen Chen, Zhehuai Chen, Hung-yi Lee · Apr 6, 2026 · Citations: 0
Automatic Metrics Tool Use
We introduce Full-Duplex-Bench-v3 (FDB-v3), a benchmark for evaluating spoken language models under naturalistic speech conditions and multi-step tool use.