- RuleForge: Automated Generation and Validation for Web Vulnerability Detection at Scale
Ayush Garg, Sophia Hager, Jacob Montiel, Aditya Tiwari, Michael Gentile · Apr 2, 2026 · Citations: 0
Expert Verification Llm As JudgeAutomatic Metrics
This paper focuses on RuleForge's architecture and operational deployment for CVE-related threat detection, with particular emphasis on our novel LLM-as-a-judge (Large Language Model as judge) confidence validation system and systematic…
- Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought
Xinghao Zhao · Mar 19, 2026 · Citations: 0
Automatic Metrics Long Horizon
Chain-of-thought (CoT) reasoning improves LLM accuracy, yet detecting failures cheaply remains elusive.
- Stabilizing Iterative Self-Training with Verified Reasoning via Symbolic Recursive Self-Alignment
Xinyu Zhang · Mar 23, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
We further demonstrate that constructing DPO preference pairs from NSRSA verification teaches the model to distinguish sound from flawed reasoning (reward accuracy 46% to 63%).
- PAVE: Premise-Aware Validation and Editing for Retrieval-Augmented LLMs
Tianyi Huang, Caden Yang, Emily Yin, Eric Wang, Michael Zhang · Mar 21, 2026 · Citations: 0
Critique Edit Automatic Metrics
In controlled ablations with a fixed retriever and backbone, PAVE outperforms simpler post-retrieval baselines in two evidence-grounded QA settings, with the largest gain reaching 32.7 accuracy points on a span-grounded benchmark.
- QED-Nano: Teaching a Tiny Model to Prove Hard Theorems
LM-Provers, Yuxiao Qu, Amrith Setlur, Jasper Dekoninck, Edward Beeching · Apr 6, 2026 · Citations: 0
Rubric Rating Automatic Metrics
To support further research on open mathematical reasoning, we release the full QED-Nano pipeline, including the QED-Nano and QED-Nano-SFT models, the FineProofs-SFT and FineProofs-RL datasets, and the training and evaluation code.
- Towards Reward Modeling for AI Tutors in Math Mistake Remediation
Kseniia Petukhova, Ekaterina Kochmar · Mar 25, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
We develop and release Bradley-Terry preference models trained on weighted-sum rankings that we automatically create from MRBench, synthetic pairs, and data combinations.
- Don't Overthink It: Inter-Rollout Action Agreement as a Free Adaptive-Compute Signal for LLM Agents
Khushal Sethi · Apr 9, 2026 · Citations: 0
Automatic Metrics Long Horizon
We introduce TrACE (Trajectorical Adaptive Compute via agrEement), a training-free controller that allocates LLM calls adaptively across agent timesteps by measuring inter-rollout action agreement.
- S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models
Jack Young · Apr 1, 2026 · Citations: 0
Automatic Metrics Long Horizon
Using roughly 48 execution-verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p < 0.001) on HumanEval.
- Top-b: Entropic Regulation of Relative Probability Bands in Autoregressive Language Processes
Deepon Halder, Raj Dabre · Mar 15, 2026 · Citations: 0
Automatic Metrics Long Horizon
Empirical validation on GPQA and GSM8K benchmarks indicates that Top-b significantly reduces generation entropy and inter-decoding variance while maintaining competitive reasoning accuracy, effectively approximating a self-regulating…
- Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning
Juming Xiong, Kevin Guo, Congning Ni, Chao Yan, Katherine Brown · Mar 9, 2026 · Citations: 0
Automatic Metrics Long Horizon
Recent self-consistency-based approaches further improve accuracy but require sampling and aggregating multiple reasoning trajectories, leading to substantial additional computational overhead.
- Truth as a Compression Artifact in Language Model Training
Konstantin Krestnikov · Mar 12, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
In the random-error setting, models strongly prefer correct completions in paired evaluation: 83.1% accuracy at balanced data and 67.0% even when correct rules appear in only 10% of the corpus.
- Agent Q-Mix: Selecting the Right Action for LLM Multi-Agent Systems through Reinforcement Learning
Eric Hanchen Jiang, Levina Li, Rui Sun, Xiao Liang, Yubei Li · Apr 1, 2026 · Citations: 0
Automatic Metrics Multi Agent
In this paper, we propose Agent Q-Mix, a reinforcement learning framework that reformulates topology selection as a cooperative Multi-Agent Reinforcement Learning (MARL) problem.
- Learning to Predict Future-Aligned Research Proposals with Language Models
Heng Wang, Pengcheng Jiang, Jiashuo Sun, Zhiyi Shi, Haofei Yu · Mar 28, 2026 · Citations: 0
Human EvalAutomatic Metrics
Across Llama-3.1 and Qwen2.5 models, future-aligned tuning improves future alignment over unaligned baselines (up to +10.6% overall FAS), and domain-expert human evaluation corroborates improved proposal quality.
- SHAPE: Stage-aware Hierarchical Advantage via Potential Estimation for LLM Reasoning
Zhengyang Ai, Zikang Shan, Xiaodong Ai, Jingxian Tang, Hangkai Hu · Apr 8, 2026 · Citations: 0
Automatic Metrics Long Horizon
Extensive experiments in math reasoning across three base models and five benchmarks demonstrate that SHAPE achieves an average accuracy gain of 3% with 30% reduced token consumption.
- TRIMS: Trajectory-Ranked Instruction Masked Supervision for Diffusion Language Models
Lingjie Chen, Ruizhong Qiu, Yuyu Fan, Yanjun Zhao, Hanghang Tong · Apr 1, 2026 · Citations: 0
Automatic Metrics Long Horizon
Experiments on LLaDA and Dream across math and coding benchmarks show that TRIMS significantly improves the accuracy-parallelism trade-off over both standard MDLM training and train-free acceleration baselines, while achieving competitive…
- How Long Reasoning Chains Influence LLMs' Judgment of Answer Factuality
Minzhu Tu, Shiyu Ni, Keping Bi · Apr 8, 2026 · Citations: 0
Human EvalAutomatic Metrics
Large language models (LLMs) has been widely adopted as a scalable surrogate for human evaluation, yet such judges remain imperfect and susceptible to surface-level biases.
- Cognitive Loop of Thought: Reversible Hierarchical Markov Chain for Efficient Mathematical Reasoning
Jia-Chen Zhang, Zheng Zhou, Yu-Jie Xiong · Apr 8, 2026 · Citations: 0
Automatic Metrics Long Horizon
Inspired by human cognitive processes, we introduce a backward verification mechanism at each hierarchical layer.
- Hierarchical Chain-of-Thought Prompting: Enhancing LLM Reasoning Performance and Efficiency
Xingshuai Huang, Derek Li, Bahareh Nikpour, Parsa Omidi · Mar 31, 2026 · Citations: 0
Automatic Metrics Long Horizon
Extensive evaluations across diverse LLMs and mathematical reasoning benchmarks show that Hi-CoT consistently improves average accuracy by 6.2% (up to 61.4% on certain models and tasks) while reducing reasoning trace length by 13.9%…
- Mi:dm K 2.5 Pro
KT Tech innovation Group · Mar 19, 2026 · Citations: 0
Automatic Metrics Long Horizon
The evolving LLM landscape requires capabilities beyond simple text generation, prioritizing multi-step reasoning, long-context understanding, and agentic workflows.
- Replaying pre-training data improves fine-tuning
Suhas Kotha, Percy Liang · Mar 5, 2026 · Citations: 0
Automatic Metrics Web Browsing
We demonstrate the success of replay in practice for fine-tuning 8B parameter models, improving agentic web navigation success by 4.5\% and Basque question-answering accuracy by 2\%.