- Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Qiyao Ma, Dechen Gao, Rui Cai, Boqi Zhao, Hanchu Zhou · Apr 8, 2026 · Citations: 0
Human EvalAutomatic Metrics General
Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values.
- PubMed Reasoner: Dynamic Reasoning-based Retrieval for Evidence-Grounded Biomedical Question Answering
Yiqing Zhang, Xiaozhong Liu, Fabricio Murai · Mar 28, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics Medicine
In this context, we introduce PubMed Reasoner, a biomedical QA agent composed of three stages: self-critic query refinement evaluates MeSH terms for coverage, alignment, and redundancy to enhance PubMed queries based on partial (metadata)…
- CRIMSON: A Clinically-Grounded LLM-Based Metric for Generative Radiology Report Evaluation
Mohammed Baharoon, Thibault Heintz, Siavash Raissi, Mahmoud Alabbad, Mona Alhammad · Mar 6, 2026 · Citations: 0
Automatic Metrics Medicine
We introduce CRIMSON, a clinically grounded evaluation framework for chest X-ray report generation that assesses reports based on diagnostic correctness, contextual relevance, and patient safety.
- Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning
Yuhang Wu, Xiangqing Shen, Fanfan Wang, Cangqi Zhou, Zhen Wu · Apr 2, 2026 · Citations: 0
Automatic Metrics General
However, current reranking models are typically optimized on static human annotated relevance labels in isolation, decoupled from the downstream generation process.
- Preference learning in shades of gray: Interpretable and bias-aware reward modeling for human preferences
Simona-Vasilica Oprea, Adela Bâra · Apr 1, 2026 · Citations: 0
Automatic Metrics General
Using the Anthropic HHRLHF dataset, we evaluate ten diverse large language models LLMs under a standard pairwise preference setting, where baseline performance remains below 0.74 ROC AUC, highlighting the difficulty of the task.
- MemRerank: Preference Memory for Personalized Product Reranking
Zhiyuan Peng, Xuyang Wu, Huaixiao Tou, Yi Fang, Yu Gong · Mar 31, 2026 · Citations: 0
Automatic Metrics General
LLM-based shopping agents increasingly rely on long purchase histories and multi-turn interactions for personalization, yet naively appending raw history to prompts is often ineffective due to noise, length, and relevance mismatch.
- Decision-Level Ordinal Modeling for Multimodal Essay Scoring with Large Language Models
Han Zhang, Jiamin Su, Li liu · Mar 16, 2026 · Citations: 0
Automatic Metrics General
Experiments on the multimodal EssayJudge dataset show that DLOM improves over a generation-based SFT baseline across scoring traits, and DLOM-GF yields further gains when modality relevance is heterogeneous.
- PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses
Minki Hong, Eunsoo Lee, Sohyun Park, Jihie Kim · Mar 11, 2026 · Citations: 0
Automatic Metrics Medicine
We propose PEEM (Prompt Engineering Evaluation Metrics), a unified framework for joint and interpretable evaluation of both prompts and responses.
- Query-focused and Memory-aware Reranker for Long Context Processing
Yuqing Li, Jiangnan Li, Mo Yu, Guoxuan Ding, Zheng Lin · Feb 12, 2026 · Citations: 0
Automatic Metrics General
It further establishes a new state-of-the-art on the LoCoMo benchmark that assesses the capabilities of dialogue understanding and memory usage.
- OneSearch-V2: The Latent Reasoning Enhanced Self-distillation Generative Search Framework
Ben Chen, Siyuan Wang, Yufei Ma, Zihan Liang, Xuxin Zhang · Mar 25, 2026 · Citations: 0
Automatic Metrics General
However, its inadequate understanding of complex queries, inefficient exploitation of latent user intents, and overfitting to narrow historical preferences have limited its further performance improvement.
- LocalSUG: Geography-Aware LLM for Query Suggestion in Local-Life Services
Jinwen Chen, Shuai Gong, Shiwen Zhang, Zheng Zhang, Yachao Zhao · Mar 5, 2026 · Citations: 0
Automatic Metrics General
While LLMs offer strong semantic generalization, deploying them in local-life services introduces three key challenges: lack of geographic grounding, exposure bias in preference optimization, and online inference latency.
- PosIR: Position-Aware Heterogeneous Information Retrieval Benchmark
Ziyang Zeng, Dun Zhang, Yu Yan, Xu Sun, Cuiqiaoshu Pan · Jan 13, 2026 · Citations: 0
Automatic Metrics Medicine
To address these limitations, we introduce PosIR (Position-Aware Information Retrieval), the first standardized benchmark designed to systematically diagnose position bias in diverse retrieval scenarios.
- CRANE: Causal Relevance Analysis of Language-Specific Neurons in Multilingual Large Language Models
Yifan Le, Yunliang Li · Jan 8, 2026 · Citations: 0
Automatic Metrics Multilingual
Prior work has identified language-related neurons mainly through activation-based heuristics, which conflate language preference with functional importance.
- When Metrics Disagree: Automatic Similarity vs. LLM-as-a-Judge for Clinical Dialogue Evaluation
Bian Sun, Zhenjian Wang, Orvill de la Torre, Zirui Wang · Feb 27, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics Medicine
Due to the resource-intensive nature of large-scale human validation, the model's performance was evaluated through a dual-track framework: Track A utilized traditional lexical similarity metrics (e.g., BLEU, ROUGE), while Track B employed…
- A Multi-Agent Framework for Medical AI: Leveraging Fine-Tuned GPT, LLaMA, and DeepSeek R1 for Evidence-Based and Bias-Aware Clinical Query Processing
Naeimeh Nourmohammadi, Md Meem Hossain, The Anh Han, Safina Showkat Ara, Zia Ush Shamszaman · Feb 15, 2026 · Citations: 0
Automatic Metrics Medicine
We propose a multi-agent medical QA framework that combines complementary LLMs with evidence retrieval, uncertainty estimation, and bias checks to improve answer reliability.
- Novel Memory Forgetting Techniques for Autonomous AI Agents: Balancing Relevance and Efficiency
Payal Fofadiya, Sunil Tiwari · Apr 2, 2026 · Citations: 0
Automatic Metrics General
Long-horizon conversational agents require persistent memory for coherent reasoning, yet uncontrolled accumulation causes temporal decay and false memory propagation.
- Chow-Liu Ordering for Long-Context Reasoning in Chain-of-Agents
Naman Gupta, Vaibhav Singh, Arun Iyer, Kirankumar Shiragur, Pratham Grover · Mar 10, 2026 · Citations: 0
Automatic Metrics General
Sequential multi-agent reasoning frameworks such as Chain-of-Agents (CoA) handle long-context queries by decomposing inputs into chunks and processing them sequentially using LLM-based worker agents that read from and update a bounded…
- LieCraft: A Multi-Agent Framework for Evaluating Deceptive Capabilities in Language Models
Matthew Lyle Olson, Neale Ratzlaff, Musashi Hinck, Tri Nguyen, Vasudev Lal · Mar 6, 2026 · Citations: 0
Automatic Metrics General
Large Language Models (LLMs) exhibit impressive general-purpose capabilities but also introduce serious safety risks, particularly the potential for deception as models acquire increased agency and human oversight diminishes.