- InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem
Shuofei Qiao, Yunxiang Wei, Xuehai Wang, Bin Wu, Boyang Xue · Feb 16, 2026 · Citations: 0
Llm As Judge Web Browsing
The rapid evolution of Large Language Models has catalyzed a surge in scientific idea production, yet this leap has not been accompanied by a matching advance in idea evaluation.
- Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric
Ruipeng Jia, Yunyi Yang, Yuxin Wu, Yongbo Gai, Siyuan Tao · Feb 15, 2026 · Citations: 0
Pairwise PreferenceRubric Rating Llm As Judge
To operationalize this view, we present the Open Rubric System (OpenRS), a plug-and-play, rubrics-based LLM-as-a-Judge framework built around Pairwise Adaptive Meta-Rubrics (PAMR) and lightweight Pointwise Verifiable Rubrics (PVRs), which…
- Small Reward Models via Backward Inference
Yike Wang, Faeze Brahman, Shangbin Feng, Teng Xiao, Hannaneh Hajishirzi · Feb 14, 2026 · Citations: 0
Rubric Rating Llm As Judge
However, the dominant LLM-as-a-Judge paradigm relies on the strong reasoning capabilities of large models, while alternative approaches require reference responses or explicit rubrics, limiting flexibility and broader accessibility.
- World-Model-Augmented Web Agents with Action Correction
Zhouzhou Shen, Xueyu Hu, Xiyun Li, Tianqing Fang, Juncheng Li · Feb 17, 2026 · Citations: 0
Llm As JudgeSimulation Env Multi Agent
To address these challenges, we propose WAC, a web agent that integrates model collaboration, consequence simulation, and feedback-driven action refinement.
- The Emergence of Lab-Driven Alignment Signatures: A Psychometric Framework for Auditing Latent Bias and Compounding Risk in Generative AI
Dusan Bosnjakovic · Feb 19, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics Multi Agent
As Large Language Models (LLMs) transition from standalone chat interfaces to foundational reasoning layers in multi-agent systems and recursive evaluation loops (LLM-as-a-judge), the detection of durable, provider-level behavioral…
- Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards
Johannes Ackermann, Michael Noukhovitch, Takashi Ishida, Masashi Sugiyama · Feb 20, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics
Reinforcement Learning from Human Feedback (RLHF) or Verifiable Rewards (RLVR) are two key steps in the post-training of modern Language Models (LMs).
- Overton Pluralistic Reinforcement Learning for Large Language Models
Yu Fu, Seongho Son, Ilija Bogunovic · Feb 24, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics
Existing alignment paradigms remain limited in capturing the pluralistic nature of human values.
- Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning
Justin Lovelace, Christian Belardi, Sofian Zalouk, Adhitya Polavaram, Srivatsa Kundurthy · Feb 24, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics
Evaluations show STAR-LDM significantly outperforms similar-sized models on language understanding benchmarks and achieves >70\% win rates in LLM-as-judge comparisons for narrative coherence and commonsense reasoning.
- Luna-2: Scalable Single-Token Evaluation with Small Language Models
Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel · Feb 20, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics
We present Luna-2, a novel architecture that leverages decoder-only small language models (SLMs) into a deterministic evaluation model to reliably compute complex task-specific LLMAJ metrics (e.g.
- Assessing Large Language Models for Medical QA: Zero-Shot and LLM-as-a-Judge Evaluation
Shefayat E Shams Adib, Ahmed Alfey Sani, Ekramul Alam Esham, Ajwad Abrar, Tareque Mohmud Chowdhury · Feb 16, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics
We are using a zero-shot evaluation methodology and using BLEU and ROUGE metrics to evaluate performance without specialized fine-tuning.
- AgenticSum: An Agentic Inference-Time Framework for Faithful Clinical Text Summarization
Fahmida Liza Piya, Rahmatollah Beheshti · Feb 23, 2026 · Citations: 0
Human EvalLlm As Judge
We present AgenticSum, an inference-time, agentic framework that separates context selection, generation, verification, and targeted correction to reduce hallucinated content.
- Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets
Hanna Yukhymenko, Anton Alexandrov, Martin Vechev · Feb 25, 2026 · Citations: 0
Llm As Judge
The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks.
- MERRY: Semantically Decoupled Evaluation of Multimodal Emotional and Role Consistencies of Role-Playing Agents
Zhenyu Wang, Xiaofen Xing, Yirong Chen, Xiangmin Xu · Feb 24, 2026 · Citations: 0
Llm As Judge
Multimodal Role-Playing Agents (MRPAs) are attracting increasing attention due to their ability to deliver more immersive multimodal emotional interactions.
- Case-Aware LLM-as-a-Judge Evaluation for Enterprise-Scale RAG Systems
Mukul Chhabra, Luigi Medrano, Arush Verma · Feb 23, 2026 · Citations: 0
Llm As Judge
Enterprise Retrieval-Augmented Generation (RAG) assistants operate in multi-turn, case-based workflows such as technical support and IT operations, where evaluation must reflect operational constraints, structured identifiers (e.g., error…
- *-PLUIE: Personalisable metric with Llm Used for Improved Evaluation
Quentin Lemesle, Léane Jourdan, Daisy Munson, Pierre Alain, Jonathan Chevelu · Feb 17, 2026 · Citations: 0
Llm As Judge
Evaluating the quality of automatically generated text often relies on LLM-as-a-judge (LLM-judge) methods.
- Algebraic Quantum Intelligence: A New Framework for Reproducible Machine Creativity
Kazuo Yano, Jonghyeok Lee, Tae Ishitomi, Hironobu Kawaguchi, Akira Koyama · Feb 15, 2026 · Citations: 0
Llm As Judge
We evaluate the resulting system on creative reasoning benchmarks spanning ten domains under an LLM-as-a-judge protocol.