- AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling
Liang Ding · Mar 22, 2026 · Citations: 0
Demonstrations Human EvalLlm As Judge Long Horizon
LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely…
- PubMed Reasoner: Dynamic Reasoning-based Retrieval for Evidence-Grounded Biomedical Question Answering
Yiqing Zhang, Xiaozhong Liu, Fabricio Murai · Mar 28, 2026 · Citations: 0
Expert Verification Llm As JudgeAutomatic Metrics
In this context, we introduce PubMed Reasoner, a biomedical QA agent composed of three stages: self-critic query refinement evaluates MeSH terms for coverage, alignment, and redundancy to enhance PubMed queries based on partial (metadata)…
- Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study
Yosuke Yamagishi, Atsushi Takamatsu, Yasunori Hamaguchi, Tomohiro Kikuchi, Shouhei Hanaoka · Apr 2, 2026 · Citations: 0
Pairwise Preference Llm As JudgeAutomatic Metrics
A board-certified radiologist and a radiology resident independently performed blinded pairwise evaluations across 4 criteria: terminology accuracy, readability, overall quality, and radiologist-style authenticity.
- Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge
Mingyang Song, Mao Zheng, Chenning Xu · Mar 11, 2026 · Citations: 0
Rubric RatingCritique Edit Llm As Judge
Through a large-scale study of 105,600 evaluation instances (32 LLMs \times 3 frontier judges \times 100 tasks \times 11 temperatures), we show that model-level agreement (Spearman ρ= 0.99) masks fragile sample-level agreement (Pearson r =…
- Self-Preference Bias in Rubric-Based Evaluation of Large Language Models
José Pombal, Ricardo Rei, André F. T. Martins · Apr 8, 2026 · Citations: 0
Pairwise PreferenceRubric Rating Llm As Judge
We present the first study of SPB in rubric-based evaluation, an increasingly popular benchmarking paradigm where judges issue binary verdicts on individual evaluation criteria, instead of assigning holistic scores or rankings.
- Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning
Zhaowei Zhang, Xiaohan Liu, Xuekai Zhu, Junchao Huang, Ceyao Zhang · Mar 11, 2026 · Citations: 0
Rubric Rating Llm As Judge
To enable stable RLVR training, we build a rubric-grounded reward pipeline by training a Qwen3-1.7B judge model.
- RuleForge: Automated Generation and Validation for Web Vulnerability Detection at Scale
Ayush Garg, Sophia Hager, Jacob Montiel, Aditya Tiwari, Michael Gentile · Apr 2, 2026 · Citations: 0
Expert Verification Llm As JudgeAutomatic Metrics
This paper focuses on RuleForge's architecture and operational deployment for CVE-related threat detection, with particular emphasis on our novel LLM-as-a-judge (Large Language Model as judge) confidence validation system and systematic…
- ThReadMed-QA: A Multi-Turn Medical Dialogue Benchmark from Real Patient Questions
Monica Munnangi, Saiph Savage · Mar 11, 2026 · Citations: 0
Rubric Rating Llm As Judge
We introduce ThReadMed-QA, a benchmark of 2,437 fully-answered patient-physician conversation threads extracted from r/AskDocs, comprising 8,204 question-answer pairs across up to 9 turns.
- Prompt Attack Detection with LLM-as-a-Judge and Mixture-of-Models
Hieu Xuan Le, Benjamin Goh, Quy Anh Tang · Mar 26, 2026 · Citations: 0
Red Team Llm As Judge
In production, guardrails must mitigate these attacks under strict low-latency constraints, resulting in a deployment gap in which lightweight classifiers and rule-based systems struggle to generalize under distribution shift, while…
- EvoIdeator: Evolving Scientific Ideas through Checklist-Grounded Reinforcement Learning
Andreas Sauter, Yuyue Zhao, Jacopo Urbani, Wenxiang Hu, Zaiqiao Meng · Mar 23, 2026 · Citations: 0
Rubric RatingCritique Edit Llm As Judge
EvoIdeator leverages a structured judge model to generate two synergistic signals: (1) lexicographic rewards for multi-dimensional optimization, and (2) fine-grained language feedback that offers span-level critiques regarding grounding,…
- Mind the Shift: Decoding Monetary Policy Stance from FOMC Statements with Large Language Models
Yixuan Tang, Yi Yang · Mar 15, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics Long Horizon
Across four LLM backbones, DCS consistently outperforms supervised probes and LLM-as-judge baselines, achieving up to 71.1% accuracy on sentence-level hawkish--dovish classification.
- LLM-as-a-Judge for Time Series Explanations
Preetham Sivalingam, Murari Mandal, Saurabh Deshpande, Dhruv Kumar · Apr 2, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics
Although modern models generate textual interpretations of numerical signals, existing evaluation methods are limited: reference based similarity metrics and consistency checking models require ground truth explanations, while traditional…
- Reasoning or Rhetoric? An Empirical Analysis of Moral Reasoning Explanations in Large Language Models
Aryan Kasat, Smriti Singh, Aman Chadha, Vinija Jain · Mar 23, 2026 · Citations: 0
Llm As Judge Long Horizon
Using an LLM-as-judge scoring pipeline validated across three judge models, we classify more than 600 responses from 13 LLMs spanning a range of architectures, parameter scales, and training regimes across six classical moral dilemmas, and…
- Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge
Xin Sun, Di Wu, Sijing Qin, Isao Echizen, Abdallah El Ali · Apr 7, 2026 · Citations: 0
Pairwise Preference Llm As Judge
Large language models (LLMs) are increasingly used as automated evaluators (LLM-as-a-Judge).
- Text-to-Stage: Spatial Layouts from Long-form Narratives
Jefferson Hernandez, Swarnadeep Saha, Chenxi Whitehouse, Sanjeel Parekh, Calvin Murdock · Mar 18, 2026 · Citations: 0
Pairwise Preference Llm As Judge
In this work, we probe the ability of a language model to demonstrate spatial reasoning from unstructured text, mimicking human capabilities and automating a process that benefits many downstream media applications.
- Multi-Task Reinforcement Learning for Enhanced Multimodal LLM-as-a-Judge
Junjie Wu, Xuan Kan, Zihao He, Shunwen Tan, Bo Pan · Mar 12, 2026 · Citations: 0
Pairwise Preference Llm As Judge
Multimodal Large Language Models (MLLMs) have been widely adopted as MLLM-as-a-Judges due to their strong alignment with human judgment across various visual tasks.
- VERI-DPO: Evidence-Aware Alignment for Clinical Summarization via Claim Verification and Direct Preference Optimization
Weixin Liu, Congning Ni, Qingyuan Song, Susannah L. Rose, Christopher Symons · Mar 11, 2026 · Citations: 0
Pairwise Preference Llm As Judge
We introduce VERI-DPO, which uses claim verification to mine preferences and distill them into the summarizer with Direct Preference Optimization (DPO).
- Grounding Arabic LLMs in the Doha Historical Dictionary: Retrieval-Augmented Understanding of Quran and Hadith
Somaya Eltanbouly, Samer Rashwani · Mar 25, 2026 · Citations: 0
Human EvalLlm As Judge
Gemini also serves as an LLM-as-a-judge system for automatic evaluation in our experiments.
- Criterion-referenceability determines LLM-as-a-judge validity across physics assessment formats
Will Yeadon, Tom Hardy, Paul Mackay, Elise Agra · Mar 16, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics
We evaluate LLM-as-a-judge marking across three physics assessment formats - structured questions, written essays, and scientific plots - comparing GPT-5.2, Grok 4.1, Claude Opus 4.5, DeepSeek-V3.2, Gemini Pro 3, and committee aggregations…
- Multi-Agent Dialectical Refinement for Enhanced Argument Classification
Jakub Bąba, Jarosław A. Chudziak · Mar 29, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics Multi Agent
We introduce MAD-ACC (Multi-Agent Debate for Argument Component Classification), a framework that leverages dialectical refinement to resolve classification uncertainty.
- Weakly Supervised Distillation of Hallucination Signals into Transformer Representations
Shoaib Sadiq Salehmohamed, Jinal Prashant Thakkar, Hansika Aredla, Shaik Mohammed Omar, Shalmali Ayachit · Apr 7, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics
We introduce a weak supervision framework that combines three complementary grounding signals: substring matching, sentence embedding similarity, and an LLM as a judge verdict to label generated responses as grounded or hallucinated without…
- Reward Prediction with Factorized World States
Yijun Shen, Delong Chen, Xianming Hu, Jiaming Mi, Hongbo Zhao · Mar 10, 2026 · Citations: 0
Llm As JudgeSimulation Env
We evaluate on RewardPrediction, a new benchmark dataset spanning five diverse domains and comprising 2,454 unique action-observation trajectories with step-wise ground-truth rewards.
- Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images
Yuechen Jiang, Enze Zhang, Md Mohsinul Kabir, Qianqian Xie, Stavroula Golfomitsou · Apr 8, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics
We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations.
- CrossTrace: A Cross-Domain Dataset of Grounded Scientific Reasoning Traces for Hypothesis Generation
Andrew Bouras, OMS-II Research Fellow · Mar 30, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics
Fine-tuning Qwen2.5-7B-Instruct on CrossTrace via QLoRA yields substantial improvements over the untuned baseline: IAScore rises from 0.828 to 0.968 (GPT-4o judge) and from 0.716 to 0.888 (Claude Opus 4.5), structural compliance improves…
- To Lie or Not to Lie? Investigating The Biased Spread of Global Lies by LLMs
Zohaib Khan, Mustafa Dogan, Ifeoma Okoh, Pouya Sadeghi, Siddhartha Shrestha · Apr 8, 2026 · Citations: 0
Llm As Judge
Using both human annotations and large-scale LLM-as-a-judge evaluations across hundreds of thousands of generations from state-of-the-art models, we show that misinformation generation varies systematically based on the country being…
- MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts
Weiyue Li, Ruizhi Qian, Yi Li, Yongce Li, Yunfan Long · Apr 7, 2026 · Citations: 0
Llm As Judge
As an initial study, we evaluate diverse LLMs under conclusion and summary prompting settings and score outputs with both reference-based metrics and LLM-as-a-judge.
- De Jure: Iterative LLM Self-Refinement for Structured Extraction of Regulatory Rules
Keerat Guliani, Deepkamal Gill, David Landsman, Nima Eshraghi, Krishna Kumar · Apr 2, 2026 · Citations: 0
Llm As Judge
We present De Jure, a fully automated, domain-agnostic pipeline for extracting structured regulatory rules from raw documents, requiring no human annotation, domain-specific prompting, or annotated gold data.
- The Necessity of Setting Temperature in LLM-as-a-Judge
Lujun Li, Lama Sleem, Yangjie Xu, Yewei Song, Aolin Jia · Mar 30, 2026 · Citations: 0
Llm As Judge
LLM-as-a-Judge has emerged as an effective and low-cost paradigm for evaluating text quality and factual correctness.
- ALBA: A European Portuguese Benchmark for Evaluating Language and Linguistic Dimensions in Generative LLMs
Inês Vieira, Inês Calvo, Iago Paulo, James Furtado, Rafael Ferreira · Mar 27, 2026 · Citations: 0
Llm As Judge
European Portuguese (pt-PT) is particularly affected, as existing training data and benchmarks are mainly in Brazilian Portuguese (pt-BR).
- Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos
Shoubin Yu, Lei Shu, Antoine Yang, Yao Fu, Srinivas Sunkara · Mar 23, 2026 · Citations: 0
Llm As Judge
To address this gap, we introduce Ego2Web, the first benchmark designed to bridge egocentric video perception and web agent execution.
- TAMTRL: Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning in Long-Context Compression
Li Wang, Yandong Wang, Xin Yu, Kui Zhang, Tianhao Peng · Mar 23, 2026 · Citations: 0
Llm As Judge
Existing approaches, such as LLM-as-a-judge or process reward models, incur substantial computational overhead and suffer from estimation noise.
- GRAFITE: Generative Regression Analysis Framework for Issue Tracking and Evaluation
Ja Young Lee, Mírian Silva, Mohamed Nasr, Shonda Witherspoon, Enzo Bozzani · Mar 18, 2026 · Citations: 0
Llm As Judge
Large language models (LLMs) are largely motivated by their performance on popular topics and benchmarks at the time of their release.
- Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
Hongbin Zhang, Kehai Chen, Xuefen Bai, Youcheng Pan, Yang Xiang · Mar 11, 2026 · Citations: 0
Llm As Judge
Large language models (LLMs) have become a standard for multilingual evaluation, yet they exhibit a severe systematic translationese bias.