- SCOPE: Selective Conformal Optimized Pairwise LLM Judging
Sher Badshah, Ali Emami, Hassan Sajjad · Feb 13, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation.
- $V_1$: Unifying Generation and Self-Verification for Parallel Reasoners
Harman Singh, Xiuyu Li, Kusha Sareen, Monishwaran Maheswaran, Sijun Tan · Mar 4, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
On code generation (LiveCodeBench, CodeContests, SWE-Bench) and math reasoning (AIME, HMMT) benchmarks, V_1-Infer improves Pass@1 by up to 10% over pointwise verification and outperforms recent test-time scaling methods while being…
- Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought
Xinghao Zhao · Mar 19, 2026 · Citations: 0
Automatic Metrics
Chain-of-thought (CoT) reasoning improves LLM accuracy, yet detecting failures cheaply remains elusive.
- RASPRef: Retrieval-Augmented Self-Supervised Prompt Refinement for Large Reasoning Models
Rahul Soni · Mar 27, 2026 · Citations: 0
Critique Edit
Recent reasoning-focused language models such as DeepSeek R1 and OpenAI o1 have demonstrated strong performance on structured reasoning benchmarks including GSM8K, MATH, and multi-hop question answering tasks.
- KLong: Training LLM Agent for Extremely Long-horizon Tasks
Yue Liu, Zhiyuan Hu, Flood Sung, Jiaheng Zhang, Bryan Hooi · Feb 19, 2026 · Citations: 0
Rubric Rating
Then, we introduce Research-Factory, an automated pipeline that generates high-quality training data by collecting research papers and constructing evaluation rubrics.
- Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models
Abraham Paul Elenjical, Vivek Hruday Kavuri, Vasudeva Varma · Feb 21, 2026 · Citations: 0
Pairwise Preference Human Eval
We introduce a psychologically grounded metacognitive framework that operationalizes Ann Brown's regulatory cycle (Planning, Monitoring, and Evaluation) as a structured prompting architecture, and study its integration within a lightweight…
- WebCoderBench: Benchmarking Web Application Generation with Comprehensive and Interpretable Evaluation Metrics
Chenxu Liu, Yingjie Fu, Wei Yang, Ying Zhang, Tao Xie · Jan 5, 2026 · Citations: 0
Pairwise Preference Llm As Judge
However, building a benchmark for LLM-generated web apps remains challenging due to the need for real-world user requirements, generalizable evaluation metrics without relying on ground-truth implementations or test cases, and interpretable…
- Stabilizing Iterative Self-Training with Verified Reasoning via Symbolic Recursive Self-Alignment
Xinyu Zhang · Mar 23, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
We further demonstrate that constructing DPO preference pairs from NSRSA verification teaches the model to distinguish sound from flawed reasoning (reward accuracy 46% to 63%).
- FOR-Prompting: From Objection to Revision via an Asymmetric Prompting Protocol
He Zhang, Anzhou Zhang, Jian Dai · Oct 2, 2025 · Citations: 0
Pairwise PreferenceCritique Edit Automatic Metrics
Beyond structured math tasks, FOR-Prompting supports refinement in open-ended and multi-stage tasks: qualitative analysis shows improved exploration, coverage, and specificity, and a blind study of human preferences found that participants…
- Top-b: Entropic Regulation of Relative Probability Bands in Autoregressive Language Processes
Deepon Halder, Raj Dabre · Mar 15, 2026 · Citations: 0
Automatic Metrics
Empirical validation on GPQA and GSM8K benchmarks indicates that Top-b significantly reduces generation entropy and inter-decoding variance while maintaining competitive reasoning accuracy, effectively approximating a self-regulating…
- SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents
Patrick Tser Jern Kon, Archana Pradeep, Ang Chen, Alexander P. Ellis, Warren Hunt · Feb 25, 2026 · Citations: 0
Automatic Metrics
Our approach combines supervised fine-tuning on expert-augmented trajectories with agentic reinforcement learning that explicitly discourages degenerative looping and unproductive expert collaboration.
- The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models
Zanlin Ni, Shenzhi Wang, Yang Yue, Tianyu Yu, Weilin Zhao · Jan 21, 2026 · Citations: 0
Automatic Metrics
We demonstrate that effective reasoning can be better elicited by intentionally forgoing arbitrary order and applying standard Group Relative Policy Optimization (GRPO) instead.
- Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs
Ngoc Bui, Shubham Sharma, Simran Lamba, Saumitra Mishra, Rex Ying · Dec 3, 2025 · Citations: 0
Automatic Metrics
Across mathematical reasoning (GSM8K, MATH-500, AIME24), procedural generation (LongProc), conversational long-memory benchmarks (LongMemEval), and long-context understanding (LongBenchV2 and SCBench), TRIM-KV consistently outperforms…
- SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling
Md Imbesat Hassan Rizvi, Xiaodan Zhu, Iryna Gurevych · Jun 18, 2025 · Citations: 0
Automatic Metrics
To address this, we introduce Single-Pass Annotation with Reference-Guided Evaluation (SPARE), a novel structured framework that enables efficient per-step annotation by jointly aligning solution steps to reference solutions and determine…
- Revisiting Self-Play Preference Optimization: On the Role of Prompt Difficulty
Yao Xiao, Jung-jae Kim, Roy Ka-wei Lee, Lidong Bing · Oct 7, 2025 · Citations: 0
Pairwise Preference
Self-play preference optimization has emerged as a prominent paradigm for aligning large language models (LLMs).
- Evaluation of Large Language Models via Coupled Token Generation
Nina Corvelo Benz, Stratis Tsirtsis, Eleni Straitouri, Ivi Chatzi, Ander Artola Velasco · Feb 3, 2025 · Citations: 0
Pairwise Preference
In this work, we argue that the evaluation and ranking of large language models should control for the randomization underpinning their functioning.
- Structurally Aligned Subtask-Level Memory for Software Engineering Agents
Kangning Shen, Jingyuan Zhang, Chenxi Sun, Wencong Zeng, Yang Yue · Feb 25, 2026 · Citations: 0
Automatic Metrics
Large Language Models (LLMs) have demonstrated significant potential as autonomous software engineering (SWE) agents.
- Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training
Yixin Liu, Yue Yu, DiJia Su, Sid Wang, Xuewei Wang · Mar 12, 2026 · Citations: 0
Pairwise Preference
Reasoning LLMs-as-Judges, which can benefit from inference-time scaling, provide a promising path for extending the success of reasoning models to non-verifiable domains where the output correctness/quality cannot be directly checked.
- Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization
Junming Yang, Ning Xu, Biao Liu, Shiqi Qiao, Xin Geng · Sep 27, 2025 · Citations: 0
Pairwise Preference
To bridge this gap, we propose Meta-Weighted Adaptive Preference Optimization (MetaAPO), a novel framework that dynamically couples data generation with model training.
- A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench
David Schlangen, Sherzod Hakimov, Chalamalasetti Kranti, Jonathan Jordan, Philipp Sadler · Jul 11, 2025 · Citations: 0
Pairwise Preference
There are currently two main paradigms for evaluating large language models (LLMs), reference-based evaluation and preference-based evaluation.
- Search Arena: Analyzing Search-Augmented LLMs
Mihran Miroyan, Tsung-Han Wu, Logan King, Tianle Li, Jiayi Pan · Jun 5, 2025 · Citations: 0
Pairwise Preference
In this work, we introduce Search Arena, a crowd-sourced, large-scale, human-preference dataset of over 24,000 paired multi-turn user interactions with search-augmented LLMs.
- Is Mathematical Problem-Solving Expertise in Large Language Models Associated with Assessment Performance?
Liang Zhang, Yu Fu, Xinyi Jin · Mar 26, 2026 · Citations: 0
- Cross-Model Disagreement as a Label-Free Correctness Signal
Matt Gorbett, Suman Jana · Mar 26, 2026 · Citations: 0
- Efficient Detection of Bad Benchmark Items with Novel Scalability Coefficients
Michael Hardy, Joshua Gilbert, Benjamin Domingue · Mar 26, 2026 · Citations: 0
- AdaRubric: Task-Adaptive Rubrics for LLM Agent Evaluation
Liang Ding · Mar 22, 2026 · Citations: 0
- FailureMem: A Failure-Aware Multimodal Framework for Autonomous Software Repair
Ruize Ma, Yilei Jiang, Shilin Zhang, Zheng Ma, Yi Feng · Mar 18, 2026 · Citations: 0
- Mediocrity is the key for LLM as a Judge Anchor Selection
Shachar Don-Yehiya, Asaf Yehudai, Leshem Choshen, Omri Abend · Mar 17, 2026 · Citations: 0
- daVinci-Env: Open SWE Environment Synthesis at Scale
Dayuan Fu, Shenyu Wu, Yunze Wu, Zerui Peng, Yaxing Huang · Mar 13, 2026 · Citations: 0
- When LLM Judge Scores Look Good but Best-of-N Decisions Fail
Eddie Landesberg · Mar 12, 2026 · Citations: 0
- NeuroLoRA: Context-Aware Neuromodulation for Parameter-Efficient Multi-Task Adaptation
Yuxin Yang, Haoran Zhang, Mingxuan Li, Jiachen Xu, Ruoxi Shen · Mar 12, 2026 · Citations: 0
- In-Context Environments Induce Evaluation-Awareness in Language Models
Maheep Chaudhary · Mar 4, 2026 · Citations: 0
- Qwen3-Coder-Next Technical Report
Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng · Feb 28, 2026 · Citations: 0
- SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale
Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, Alexander Golubev · Feb 27, 2026 · Citations: 0
- SC-Arena: A Natural Language Benchmark for Single-Cell Reasoning with Knowledge-Augmented Evaluation
Jiahao Zhao, Feng Jiang, Shaowei Qin, Zhonghui Zhang, Junhao Liu · Feb 26, 2026 · Citations: 0
- Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding
Shijing Hu, Jingyang Li, Zhihui Lu, Pan Zhou · Sep 26, 2025 · Citations: 0
- LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning
Yuhao Wu, Yushi Bai, Zhiqiang Hu, Roy Ka-Wei Lee, Juanzi Li · Jun 23, 2025 · Citations: 0
- Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents
Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, Jeff Clune · May 29, 2025 · Citations: 0