- Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification
Moises Andrade, Joonhyuk Cha, Brandon Ho, Vriksha Srihari, Karmesh Yadav · Jul 15, 2025 · Citations: 0
Pairwise Preference Automatic MetricsSimulation Env Long Horizon
We evaluate MLLM verifiers across web navigation, computer use, and robotics, spanning 13+ models, 28+ designs, and thousands of trajectories from diverse agents.
- RuleForge: Automated Generation and Validation for Web Vulnerability Detection at Scale
Ayush Garg, Sophia Hager, Jacob Montiel, Aditya Tiwari, Michael Gentile · Apr 2, 2026 · Citations: 0
Expert Verification Llm As JudgeAutomatic Metrics
This paper focuses on RuleForge's architecture and operational deployment for CVE-related threat detection, with particular emphasis on our novel LLM-as-a-judge (Large Language Model as judge) confidence validation system and systematic…
- $V_1$: Unifying Generation and Self-Verification for Parallel Reasoners
Harman Singh, Xiuyu Li, Kusha Sareen, Monishwaran Maheswaran, Sijun Tan · Mar 4, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
On code generation (LiveCodeBench, CodeContests, SWE-Bench) and math reasoning (AIME, HMMT) benchmarks, V_1-Infer improves Pass@1 by up to 10% over pointwise verification and outperforms recent test-time scaling methods while being…
- Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences
Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu · Feb 25, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates.
- Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought
Xinghao Zhao · Mar 19, 2026 · Citations: 0
Automatic Metrics Long Horizon
Chain-of-thought (CoT) reasoning improves LLM accuracy, yet detecting failures cheaply remains elusive.
- Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling
Jeffrey T. H. Wong, Zixi Zhang, Junyi Liu, Yiren Zhao · Feb 18, 2026 · Citations: 0
Automatic Metrics Multi Agent
Existing Multi-Agent Systems (MAS) typically rely on homogeneous model configurations, failing to exploit the diverse expertise inherent in different post-trained architectures.
- Stabilizing Iterative Self-Training with Verified Reasoning via Symbolic Recursive Self-Alignment
Xinyu Zhang · Mar 23, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
We further demonstrate that constructing DPO preference pairs from NSRSA verification teaches the model to distinguish sound from flawed reasoning (reward accuracy 46% to 63%).
- PAVE: Premise-Aware Validation and Editing for Retrieval-Augmented LLMs
Tianyi Huang, Caden Yang, Emily Yin, Eric Wang, Michael Zhang · Mar 21, 2026 · Citations: 0
Critique Edit Automatic Metrics
In controlled ablations with a fixed retriever and backbone, PAVE outperforms simpler post-retrieval baselines in two evidence-grounded QA settings, with the largest gain reaching 32.7 accuracy points on a span-grounded benchmark.
- GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA
Zhichao Wang · Oct 27, 2025 · Citations: 0
Pairwise Preference Automatic Metrics
This paper proposes Group-relative Implicit Fine-Tuning (GIFT), a reinforcement learning framework for aligning large language models (LLMs) that unifies on-policy optimization with implicit preference learning.
- FOR-Prompting: From Objection to Revision via an Asymmetric Prompting Protocol
He Zhang, Anzhou Zhang, Jian Dai · Oct 2, 2025 · Citations: 0
Pairwise PreferenceCritique Edit Automatic Metrics
Beyond structured math tasks, FOR-Prompting supports refinement in open-ended and multi-stage tasks: qualitative analysis shows improved exploration, coverage, and specificity, and a blind study of human preferences found that participants…
- LaTeXTrans: Structured LaTeX Translation with Multi-Agent Coordination
Ziming Zhu, Chenglong Wang, Haosong Xv, Shunjie Xing, Yifu Huo · Aug 26, 2025 · Citations: 0
Demonstrations Automatic Metrics Multi Agent
In this paper, we introduce LaTeXTrans, a collaborative multi-agent system designed to address this challenge.
- MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision
Zixuan Ke, Austin Xu, Yifei Ming, Xuan-Phi Nguyen, Ryan Chin · May 21, 2025 · Citations: 0
Critique Edit Automatic Metrics Multi Agent
Multi-agent systems (MAS) leveraging the impressive capabilities of Large Language Models (LLMs) hold significant potential for tackling complex tasks.
- QED-Nano: Teaching a Tiny Model to Prove Hard Theorems
LM-Provers, Yuxiao Qu, Amrith Setlur, Jasper Dekoninck, Edward Beeching · Apr 6, 2026 · Citations: 0
Rubric Rating Automatic Metrics
To support further research on open mathematical reasoning, we release the full QED-Nano pipeline, including the QED-Nano and QED-Nano-SFT models, the FineProofs-SFT and FineProofs-RL datasets, and the training and evaluation code.
- Towards Reward Modeling for AI Tutors in Math Mistake Remediation
Kseniia Petukhova, Ekaterina Kochmar · Mar 25, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
We develop and release Bradley-Terry preference models trained on weighted-sum rankings that we automatically create from MRBench, synthetic pairs, and data combinations.
- Surgical Post-Training: Cutting Errors, Keeping Knowledge
Wenye Lin, Kai Han · Mar 2, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
While prior research emphasizes the role of on-policy data in mitigating forgetting, we uncover--and validate both theoretically and empirically--an overlooked yet critical mechanism: the implicit regularization inherent in Direct…
- Evaluating Vision-Language and Large Language Models for Automated Student Assessment in Indonesian Classrooms
Nurul Aisyah, Muhammad Dehan Al Kautsar, Arif Hidayat, Raqib Chowdhury, Fajri Koto · Jun 5, 2025 · Citations: 0
Rubric Rating Automatic Metrics
Assessment tasks include grading and generating personalized Indonesian feedback guided by rubric-based evaluation.
- Don't Overthink It: Inter-Rollout Action Agreement as a Free Adaptive-Compute Signal for LLM Agents
Khushal Sethi · Apr 9, 2026 · Citations: 0
Automatic Metrics Long Horizon
We introduce TrACE (Trajectorical Adaptive Compute via agrEement), a training-free controller that allocates LLM calls adaptively across agent timesteps by measuring inter-rollout action agreement.
- S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models
Jack Young · Apr 1, 2026 · Citations: 0
Automatic Metrics Long Horizon
Using roughly 48 execution-verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p < 0.001) on HumanEval.
- Top-b: Entropic Regulation of Relative Probability Bands in Autoregressive Language Processes
Deepon Halder, Raj Dabre · Mar 15, 2026 · Citations: 0
Automatic Metrics Long Horizon
Empirical validation on GPQA and GSM8K benchmarks indicates that Top-b significantly reduces generation entropy and inter-decoding variance while maintaining competitive reasoning accuracy, effectively approximating a self-regulating…
- Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning
Juming Xiong, Kevin Guo, Congning Ni, Chao Yan, Katherine Brown · Mar 9, 2026 · Citations: 0
Automatic Metrics Long Horizon
Recent self-consistency-based approaches further improve accuracy but require sampling and aggregating multiple reasoning trajectories, leading to substantial additional computational overhead.
- The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models
Zanlin Ni, Shenzhi Wang, Yang Yue, Tianyu Yu, Weilin Zhao · Jan 21, 2026 · Citations: 0
Automatic Metrics Long Horizon
We demonstrate that effective reasoning can be better elicited by intentionally forgoing arbitrary order and applying standard Group Relative Policy Optimization (GRPO) instead.
- Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs
Ngoc Bui, Shubham Sharma, Simran Lamba, Saumitra Mishra, Rex Ying · Dec 3, 2025 · Citations: 0
Automatic Metrics Long Horizon
Across mathematical reasoning (GSM8K, MATH-500, AIME24), procedural generation (LongProc), conversational long-memory benchmarks (LongMemEval), and long-context understanding (LongBenchV2 and SCBench), TRIM-KV consistently outperforms…
- SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling
Md Imbesat Hassan Rizvi, Xiaodan Zhu, Iryna Gurevych · Jun 18, 2025 · Citations: 0
Automatic Metrics Long Horizon
To address this, we introduce Single-Pass Annotation with Reference-Guided Evaluation (SPARE), a novel structured framework that enables efficient per-step annotation by jointly aligning solution steps to reference solutions and determine…
- Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models
Yubo Li, Xiaobin Shen, Xinyu Yao, Xueying Ding, Yidi Miao · Apr 7, 2025 · Citations: 0
Red Team Automatic Metrics
We organize existing benchmarks and datasets into coherent categories reflecting the evolving landscape of multi-turn dialogue evaluation, and review a broad spectrum of enhancement methodologies, including model-centric strategies…
- Truth as a Compression Artifact in Language Model Training
Konstantin Krestnikov · Mar 12, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
In the random-error setting, models strongly prefer correct completions in paired evaluation: 83.1% accuracy at balanced data and 67.0% even when correct rules appear in only 10% of the corpus.
- Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning
Qihao Liu, Luoxin Ye, Wufei Ma, Yu-Cheng Chou, Alan Yuille · Dec 18, 2025 · Citations: 0
Pairwise Preference Automatic Metrics
Across various mathematical benchmarks, the method delivers consistent gains over strong baselines with standard RL post-training.
- Agent Q-Mix: Selecting the Right Action for LLM Multi-Agent Systems through Reinforcement Learning
Eric Hanchen Jiang, Levina Li, Rui Sun, Xiao Liang, Yubei Li · Apr 1, 2026 · Citations: 0
Automatic Metrics Multi Agent
In this paper, we propose Agent Q-Mix, a reinforcement learning framework that reformulates topology selection as a cooperative Multi-Agent Reinforcement Learning (MARL) problem.
- BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios
Yunseung Lee, Subin Kim, Youngjun Kwak, Jaegul Choo · Feb 19, 2026 · Citations: 0
Automatic Metrics Long Horizon
However, such errors have rarely been captured by existing benchmarks.
- Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models
Teng Wang, Zhangyi Jiang, Zhenqi He, Shenyang Tong, Wenhan Yang · Mar 16, 2025 · Citations: 0
Automatic Metrics Long Horizon
Empirical results on the PRM800K dataset show that HRM, together with HNC, provides more stable and reliable evaluations than PRM.
- Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards
Johannes Ackermann, Michael Noukhovitch, Takashi Ishida, Masashi Sugiyama · Feb 20, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics
Reinforcement Learning from Human Feedback (RLHF) or Verifiable Rewards (RLVR) are two key steps in the post-training of modern Language Models (LMs).
- Learning to Predict Future-Aligned Research Proposals with Language Models
Heng Wang, Pengcheng Jiang, Jiashuo Sun, Zhiyi Shi, Haofei Yu · Mar 28, 2026 · Citations: 0
Human EvalAutomatic Metrics
Across Llama-3.1 and Qwen2.5 models, future-aligned tuning improves future alignment over unaligned baselines (up to +10.6% overall FAS), and domain-expert human evaluation corroborates improved proposal quality.
- SHAPE: Stage-aware Hierarchical Advantage via Potential Estimation for LLM Reasoning
Zhengyang Ai, Zikang Shan, Xiaodong Ai, Jingxian Tang, Hangkai Hu · Apr 8, 2026 · Citations: 0
Automatic Metrics Long Horizon
Extensive experiments in math reasoning across three base models and five benchmarks demonstrate that SHAPE achieves an average accuracy gain of 3% with 30% reduced token consumption.
- TRIMS: Trajectory-Ranked Instruction Masked Supervision for Diffusion Language Models
Lingjie Chen, Ruizhong Qiu, Yuyu Fan, Yanjun Zhao, Hanghang Tong · Apr 1, 2026 · Citations: 0
Automatic Metrics Long Horizon
Experiments on LLaDA and Dream across math and coding benchmarks show that TRIMS significantly improves the accuracy-parallelism trade-off over both standard MDLM training and train-free acceleration baselines, while achieving competitive…
- GenDB: The Next Generation of Query Processing -- Synthesized, Not Engineered
Jiale Lao, Immanuel Trummer · Mar 2, 2026 · Citations: 0
Automatic Metrics Multi Agent
As a proof of concept, we present GenDB, an LLM-powered agentic system that generates instance-optimized and customized query execution code tailored to specific data, workloads, and hardware resources.
- Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance
Yanwei Ren, Haotian Zhang, Likang Xiao, Xikai Zhang, Jiaxing Huang · Feb 27, 2026 · Citations: 0
Automatic Metrics Long Horizon
To address these issues, we propose SCOPE (Step-wise Correction for On-Policy Exploration), a novel framework that utilizes Process Reward Models to pinpoint the first erroneous step in suboptimal rollouts and applies fine-grained,…
- Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching
Roy Miles, Aysim Toker, Andreea-Maria Oncescu, Songcen Xu, Jiankang Deng · Feb 26, 2026 · Citations: 0
Automatic Metrics Long Horizon
This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search.
- GATES: Self-Distillation under Privileged Context with Consensus Gating
Alex Stein, Furong Huang, Tom Goldstein · Feb 24, 2026 · Citations: 0
Automatic Metrics Long Horizon
Held-out in-domain accuracy under asymmetric evaluation improves from 46.0\% to 62.0\%, and average (maj@8) accuracy on public document-free math benchmarks improves from 20.2\% to 35.4\%.
- Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning
Ziyan Wang, Zheng Wang, Xingwei Qu, Qi Cheng, Jie Fu · Oct 5, 2025 · Citations: 0
Automatic Metrics Long Horizon
Specifically, it outperforms GRPO by up to 2.80 points in average on math reasoning benchmarks.
- AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning
Yutong Wang, Siyuan Xiong, Xuebo Liu, Wenkang Zhou, Liang Ding · Feb 26, 2026 · Citations: 0
Automatic Metrics Multi Agent
We propose AgentDropoutV2, a test-time rectify-or-reject pruning framework designed to dynamically optimize MAS information flow without retraining.
- How Long Reasoning Chains Influence LLMs' Judgment of Answer Factuality
Minzhu Tu, Shiyu Ni, Keping Bi · Apr 8, 2026 · Citations: 0
Human EvalAutomatic Metrics
Large language models (LLMs) has been widely adopted as a scalable surrogate for human evaluation, yet such judges remain imperfect and susceptible to surface-level biases.
- Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space
Wang Zixian · Jan 18, 2026 · Citations: 0
Automatic Metrics Long Horizon
Experiments on MATH benchmarks show that the Hilbert projection formulation prevents gradient saturation typical of KL-constrained methods.
- Cognitive Loop of Thought: Reversible Hierarchical Markov Chain for Efficient Mathematical Reasoning
Jia-Chen Zhang, Zheng Zhou, Yu-Jie Xiong · Apr 8, 2026 · Citations: 0
Automatic Metrics Long Horizon
Inspired by human cognitive processes, we introduce a backward verification mechanism at each hierarchical layer.
- Hierarchical Chain-of-Thought Prompting: Enhancing LLM Reasoning Performance and Efficiency
Xingshuai Huang, Derek Li, Bahareh Nikpour, Parsa Omidi · Mar 31, 2026 · Citations: 0
Automatic Metrics Long Horizon
Extensive evaluations across diverse LLMs and mathematical reasoning benchmarks show that Hi-CoT consistently improves average accuracy by 6.2% (up to 61.4% on certain models and tasks) while reducing reasoning trace length by 13.9%…
- Mi:dm K 2.5 Pro
KT Tech innovation Group · Mar 19, 2026 · Citations: 0
Automatic Metrics Long Horizon
The evolving LLM landscape requires capabilities beyond simple text generation, prioritizing multi-step reasoning, long-context understanding, and agentic workflows.
- Replaying pre-training data improves fine-tuning
Suhas Kotha, Percy Liang · Mar 5, 2026 · Citations: 0
Automatic Metrics Web Browsing
We demonstrate the success of replay in practice for fine-tuning 8B parameter models, improving agentic web navigation success by 4.5\% and Basque question-answering accuracy by 2\%.
- ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning
Hyeonje Choi, Jeongsoo Lee, Hyojun Lee, Jay-Yoon Lee · Feb 24, 2026 · Citations: 0
Automatic Metrics Long Horizon
We introduce \ToolMATH, a math-grounded benchmark that evaluates tool-augmented language models in realistic multi-tool environments where the output depends on calling schema-specified tools and sustaining multi-step execution.
- Do LLMs and VLMs Share Neurons for Inference? Evidence and Mechanisms of Cross-Modal Transfer
Chenhang Cui, An Zhang, Yuxin Chen, Gelei Deng, Jingnan Zheng · Feb 22, 2026 · Citations: 0
Automatic Metrics Long Horizon
Across diverse mathematics and perception benchmarks, SNRF consistently enhances LVLM inference performance while preserving perceptual capabilities.