- Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation
Mohsen Hariri, Amirhossein Samandar, Michael Hinczewski, Vipin Chaudhary · Oct 5, 2025 · Citations: 0
Rubric Rating Automatic MetricsSimulation Env
We present a principled Bayesian evaluation framework that replaces Pass@k and average accuracy over N trials (avg@N) with posterior estimates of a model's underlying success probability and credible intervals, yielding stable rankings and…
- $V_1$: Unifying Generation and Self-Verification for Parallel Reasoners
Harman Singh, Xiuyu Li, Kusha Sareen, Monishwaran Maheswaran, Sijun Tan · Mar 4, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
On code generation (LiveCodeBench, CodeContests, SWE-Bench) and math reasoning (AIME, HMMT) benchmarks, V_1-Infer improves Pass@1 by up to 10% over pointwise verification and outperforms recent test-time scaling methods while being…
- Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models
Abraham Paul Elenjical, Vivek Hruday Kavuri, Vasudeva Varma · Feb 21, 2026 · Citations: 0
Pairwise Preference Human Eval
We introduce a psychologically grounded metacognitive framework that operationalizes Ann Brown's regulatory cycle (Planning, Monitoring, and Evaluation) as a structured prompting architecture, and study its integration within a lightweight…
- SortedRL: Accelerating RL Training for LLMs through Online Length-Aware Scheduling
Yiqi Zhang, Huiqiang Jiang, Xufang Luo, Zhihe Yang, Chengruidong Zhang · Mar 24, 2026 · Citations: 0
- TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning
Alliot Nagle, Jakhongir Saydaliev, Dhia Garbaya, Michael Gastpar, Ashok Vardhan Makkuva · Mar 13, 2026 · Citations: 0
- PostTrainBench: Can LLM Agents Automate LLM Post-Training?
Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen · Mar 9, 2026 · Citations: 0
- Tool Verification for Test-Time Reinforcement Learning
Ruotong Liao, Nikolai Röhrich, Xiaohan Wang, Yuhui Zhang, Yasaman Samadzadeh · Mar 2, 2026 · Citations: 0
- CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning
Xinyu Zhu, Yihao Feng, Yanchao Sun, Xianzhi Du, Pingzhi Li · Mar 1, 2026 · Citations: 0
- Sparks of Cooperative Reasoning: LLMs as Strategic Hanabi Agents
Mahesh Ramesh, Kaousheik Jayakumar, Aswinkumar Ramkumar, Pavan Thodima, Aniket Rege · Jan 26, 2026 · Citations: 0
- Towards Self-Evolving Benchmarks: Synthesizing Agent Trajectories via Test-Time Exploration under Validate-by-Reproduce Paradigm
Dadi Guo, Tianyi Zhou, Dongrui Liu, Chen Qian, Qihan Ren · Oct 1, 2025 · Citations: 0
- MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes
Changsheng Zhao, Ernie Chang, Zechun Liu, Chia-Jung Chang, Wei Wen · Sep 29, 2025 · Citations: 0