- $V_1$: Unifying Generation and Self-Verification for Parallel Reasoners
Harman Singh, Xiuyu Li, Kusha Sareen, Monishwaran Maheswaran, Sijun Tan · Mar 4, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
On code generation (LiveCodeBench, CodeContests, SWE-Bench) and math reasoning (AIME, HMMT) benchmarks, V_1-Infer improves Pass@1 by up to 10% over pointwise verification and outperforms recent test-time scaling methods while being…
- KLong: Training LLM Agent for Extremely Long-horizon Tasks
Yue Liu, Yingwei Ma, Yibo Miao, Yanhao Li, Yuchong Xie · Feb 19, 2026 · Citations: 0
Rubric Rating
Then, we introduce Research-Factory, an automated pipeline that generates high-quality training data by collecting research papers and constructing evaluation rubrics.
- S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models
Jack Young · Apr 1, 2026 · Citations: 0
Automatic Metrics
Using roughly 48 execution-verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p < 0.001) on HumanEval.
- Top-b: Entropic Regulation of Relative Probability Bands in Autoregressive Language Processes
Deepon Halder, Raj Dabre · Mar 15, 2026 · Citations: 0
Automatic Metrics
Empirical validation on GPQA and GSM8K benchmarks indicates that Top-b significantly reduces generation entropy and inter-decoding variance while maintaining competitive reasoning accuracy, effectively approximating a self-regulating…
- SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents
Patrick Tser Jern Kon, Archana Pradeep, Ang Chen, Alexander P. Ellis, Warren Hunt · Feb 25, 2026 · Citations: 0
Automatic Metrics
Our approach combines supervised fine-tuning on expert-augmented trajectories with agentic reinforcement learning that explicitly discourages degenerative looping and unproductive expert collaboration.
- D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models
Shunsuke Ubukata · Feb 25, 2026 · Citations: 0
Automatic Metrics
In this study, we propose Disciplined Chain-of-Thought (D-CoT), a novel framework that enforces a structured reasoning process using control tags -- such as <TEMP_LOW> for fact-checking and <TEMP_HIGH> for multi-perspective exploration --…
- Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs
Ngoc Bui, Shubham Sharma, Simran Lamba, Saumitra Mishra, Rex Ying · Dec 3, 2025 · Citations: 0
Automatic Metrics
Across mathematical reasoning (GSM8K, MATH-500, AIME24), procedural generation (LongProc), conversational long-memory benchmarks (LongMemEval), and long-context understanding (LongBenchV2 and SCBench), TRIM-KV consistently outperforms…
- Structurally Aligned Subtask-Level Memory for Software Engineering Agents
Kangning Shen, Jingyuan Zhang, Chenxi Sun, Wencong Zeng, Yang Yue · Feb 25, 2026 · Citations: 0
Automatic Metrics
Large Language Models (LLMs) have demonstrated significant potential as autonomous software engineering (SWE) agents.
- DeepPrune: Parallel Scaling without Inter-trace Redundancy
Shangqing Tu, Yaxuan Li, Yushi Bai, Lei Hou, Juanzi Li · Oct 9, 2025 · Citations: 0
Llm As JudgeAutomatic Metrics
Our method features a specialized judge model trained with out-of-distribution data (AIME 2022, AIME 2023, and MATH 500) using oversampling techniques to accurately predict answer equivalence from partial reasoning traces, achieving 0.7072…
- Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis
Tae-Eun Song · Mar 23, 2026 · Citations: 0
Automatic Metrics
LLM coding benchmarks face a credibility crisis: widespread solution leakage and test quality issues undermine SWE-bench Verified, while existing detection methods--paraphrase consistency, n-gram overlap, perplexity analysis--never directly…
- Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models
Teng Wang, Zhangyi Jiang, Zhenqi He, Shenyang Tong, Wenhan Yang · Mar 16, 2025 · Citations: 0
Automatic Metrics
Empirical results on the PRM800K dataset show that HRM, together with HNC, provides more stable and reliable evaluations than PRM.
- Schema for In-Context Learning
Pan Chen, Shaohong Chen, Mark Wang, Shi Xuan Leong, Priscilla Fung · Oct 14, 2025 · Citations: 0
Demonstrations
Inspired by cognitive science, specifically schema theory, which holds that humans interpret new information by activating pre-existing mental frameworks (schemas) to structure understanding, we introduce Schema-Activated In-Context…
- Peer-Predictive Self-Training for Language Model Reasoning
Shi Feng, Hanlin Zhang, Fan Nie, Sham Kakade, Yiling Chen · Apr 14, 2026 · Citations: 0
- Sensitivity-Positional Co-Localization in GQA Transformers
Manoj Chandrashekar Rao · Apr 9, 2026 · Citations: 0
- Squeeze Evolve: Unified Multi-Model Orchestration for Verifier-Free Evolution
Monishwaran Maheswaran, Leon Lakhani, Zhongzhu Zhou, Shijia Yang, Junxiong Wang · Apr 9, 2026 · Citations: 0
- SortedRL: Accelerating RL Training for LLMs through Online Length-Aware Scheduling
Yiqi Zhang, Huiqiang Jiang, Xufang Luo, Zhihe Yang, Chengruidong Zhang · Mar 24, 2026 · Citations: 0
- Off-Policy Value-Based Reinforcement Learning for Large Language Models
Peng-Yuan Wang, Ziniu Li, Tian Xu, Bohan Yang, Tian-Shuo Liu · Mar 24, 2026 · Citations: 0
- Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?
Richard J. Young · Mar 23, 2026 · Citations: 0
- AdaRubric: Task-Adaptive Rubrics for LLM Agent Evaluation
Liang Ding · Mar 22, 2026 · Citations: 0
- FailureMem: A Failure-Aware Multimodal Framework for Autonomous Software Repair
Ruize Ma, Yilei Jiang, Shilin Zhang, Zheng Ma, Yi Feng · Mar 18, 2026 · Citations: 0
- daVinci-Env: Open SWE Environment Synthesis at Scale
Dayuan Fu, Shenyu Wu, Yunze Wu, Zerui Peng, Yaxing Huang · Mar 13, 2026 · Citations: 0
- TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning
Alliot Nagle, Jakhongir Saydaliev, Dhia Garbaya, Michael Gastpar, Ashok Vardhan Makkuva · Mar 13, 2026 · Citations: 0
- Tool Verification for Test-Time Reinforcement Learning
Ruotong Liao, Nikolai Röhrich, Xiaohan Wang, Yuhui Zhang, Yasaman Samadzadeh · Mar 2, 2026 · Citations: 0
- CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning
Xinyu Zhu, Yihao Feng, Yanchao Sun, Xianzhi Du, Pingzhi Li · Mar 1, 2026 · Citations: 0
- Qwen3-Coder-Next Technical Report
Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng · Feb 28, 2026 · Citations: 0
- Draft-Thinking: Learning Efficient Reasoning in Long Chain-of-Thought LLMs
Jie Cao, Tianwei Lin, Zhenxuan Fan, Bo Yuan, Ziyuan Zhao · Feb 28, 2026 · Citations: 0
- SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale
Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, Alexander Golubev · Feb 27, 2026 · Citations: 0
- Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents
Zhi Chen, Zhensu Sun, Yuling Shi, Chao Peng, Xiaodong Gu · Feb 8, 2026 · Citations: 0
- TRIM: Hybrid Inference via Targeted Stepwise Routing in Multi-Step Reasoning Tasks
Vansh Kapoor, Aman Gupta, Hao Chen, Anurag Beniwal, Jing Huang · Jan 15, 2026 · Citations: 0
- PILOT: Planning via Internalized Latent Optimization Trajectories for Large Language Models
Haoyu Zheng, Yun Zhu, Yuqian Yuan, Bo Yuan, Wenqiao Zhang · Jan 7, 2026 · Citations: 0
- SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models
Chenyu Wang, Paria Rashidinejad, DiJia Su, Song Jiang, Sid Wang · Oct 10, 2025 · Citations: 0
- Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents
Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, Jeff Clune · May 29, 2025 · Citations: 0
- Beyond Final Code: A Process-Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios
Zhi Chen, Wei Ma, Lingxiao Jiang · Mar 16, 2025 · Citations: 0