- CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics
Vaibhav Devraj, Dhruv Kumar, Jagat Sesh Challa, Parth Agarwal, Navya Kommuri · Dec 26, 2025 · Citations: 0
Expert Verification Automatic Metrics
To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data.
- FairMed-XGB: A Bayesian-Optimised Multi-Metric Framework with Explainability for Demographic Equity in Critical Healthcare Data
Mitul Goswami, Romit Chatterjee, Arif Ahmed Sekh · Mar 16, 2026 · Citations: 0
Expert Verification Automatic Metrics
Post-mitigation evaluation on seven clinically distinct cohorts derived from the MIMIC-IV-ED and eICU databases demonstrates substantial bias reduction: Statistical Parity Difference decreases by 40 to 51 percent on MIMIC-IV-ED and 10 to 19…
- Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought
Xinghao Zhao · Mar 19, 2026 · Citations: 0
Automatic Metrics
Chain-of-thought (CoT) reasoning improves LLM accuracy, yet detecting failures cheaply remains elusive.
- RASPRef: Retrieval-Augmented Self-Supervised Prompt Refinement for Large Reasoning Models
Rahul Soni · Mar 27, 2026 · Citations: 0
Critique Edit
Recent reasoning-focused language models such as DeepSeek R1 and OpenAI o1 have demonstrated strong performance on structured reasoning benchmarks including GSM8K, MATH, and multi-hop question answering tasks.
- Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models
Abraham Paul Elenjical, Vivek Hruday Kavuri, Vasudeva Varma · Feb 21, 2026 · Citations: 0
Pairwise Preference Human Eval
We introduce a psychologically grounded metacognitive framework that operationalizes Ann Brown's regulatory cycle (Planning, Monitoring, and Evaluation) as a structured prompting architecture, and study its integration within a lightweight…
- Structured Agent Distillation for Large Language Model
Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Tianqi Li · May 20, 2025 · Citations: 0
Demonstrations Simulation Env
Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-style frameworks.
- Stabilizing Iterative Self-Training with Verified Reasoning via Symbolic Recursive Self-Alignment
Xinyu Zhang · Mar 23, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
We further demonstrate that constructing DPO preference pairs from NSRSA verification teaches the model to distinguish sound from flawed reasoning (reward accuracy 46% to 63%).
- GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA
Zhichao Wang · Oct 27, 2025 · Citations: 0
Pairwise Preference Automatic Metrics
This paper proposes Group-relative Implicit Fine-Tuning (GIFT), a reinforcement learning framework for aligning large language models (LLMs) that unifies on-policy optimization with implicit preference learning.
- FOR-Prompting: From Objection to Revision via an Asymmetric Prompting Protocol
He Zhang, Anzhou Zhang, Jian Dai · Oct 2, 2025 · Citations: 0
Pairwise PreferenceCritique Edit Automatic Metrics
Beyond structured math tasks, FOR-Prompting supports refinement in open-ended and multi-stage tasks: qualitative analysis shows improved exploration, coverage, and specificity, and a blind study of human preferences found that participants…
- Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP
Lorenz Hufe, Constantin Venhoff, Erblina Purelku, Maximilian Dreyer, Sebastian Lapuschkin · Aug 28, 2025 · Citations: 0
Red Team Automatic Metrics
These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.
- LLM-as-a-Judge for Time Series Explanations
Preetham Sivalingam, Murari Mandal, Saurabh Deshpande, Dhruv Kumar · Apr 2, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics
Although modern models generate textual interpretations of numerical signals, existing evaluation methods are limited: reference based similarity metrics and consistency checking models require ground truth explanations, while traditional…
- Don't Overthink It: Inter-Rollout Action Agreement as a Free Adaptive-Compute Signal for LLM Agents
Khushal Sethi · Apr 9, 2026 · Citations: 0
Automatic Metrics
We introduce TrACE (Trajectorical Adaptive Compute via agrEement), a training-free controller that allocates LLM calls adaptively across agent timesteps by measuring inter-rollout action agreement.
- S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models
Jack Young · Apr 1, 2026 · Citations: 0
Automatic Metrics
Using roughly 48 execution-verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p < 0.001) on HumanEval.
- Top-b: Entropic Regulation of Relative Probability Bands in Autoregressive Language Processes
Deepon Halder, Raj Dabre · Mar 15, 2026 · Citations: 0
Automatic Metrics
Empirical validation on GPQA and GSM8K benchmarks indicates that Top-b significantly reduces generation entropy and inter-decoding variance while maintaining competitive reasoning accuracy, effectively approximating a self-regulating…
- D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models
Shunsuke Ubukata · Feb 25, 2026 · Citations: 0
Automatic Metrics
In this study, we propose Disciplined Chain-of-Thought (D-CoT), a novel framework that enforces a structured reasoning process using control tags -- such as <TEMP_LOW> for fact-checking and <TEMP_HIGH> for multi-perspective exploration --…
- The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models
Zanlin Ni, Shenzhi Wang, Yang Yue, Tianyu Yu, Weilin Zhao · Jan 21, 2026 · Citations: 0
Automatic Metrics
We demonstrate that effective reasoning can be better elicited by intentionally forgoing arbitrary order and applying standard Group Relative Policy Optimization (GRPO) instead.
- Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs
Ngoc Bui, Shubham Sharma, Simran Lamba, Saumitra Mishra, Rex Ying · Dec 3, 2025 · Citations: 0
Automatic Metrics
Across mathematical reasoning (GSM8K, MATH-500, AIME24), procedural generation (LongProc), conversational long-memory benchmarks (LongMemEval), and long-context understanding (LongBenchV2 and SCBench), TRIM-KV consistently outperforms…
- SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling
Md Imbesat Hassan Rizvi, Xiaodan Zhu, Iryna Gurevych · Jun 18, 2025 · Citations: 0
Automatic Metrics
To address this, we introduce Single-Pass Annotation with Reference-Guided Evaluation (SPARE), a novel structured framework that enables efficient per-step annotation by jointly aligning solution steps to reference solutions and determine…
- AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents
Jingxu Xie, Dylan Xu, Xuandong Zhao, Dawn Song · Jun 17, 2025 · Citations: 0
Automatic Metrics
We introduce AgentSynth, a scalable and cost-efficient pipeline for automatically synthesizing high-quality tasks and trajectory datasets for generalist computer-use agents.
- DeepPrune: Parallel Scaling without Inter-trace Redundancy
Shangqing Tu, Yaxuan Li, Yushi Bai, Lei Hou, Juanzi Li · Oct 9, 2025 · Citations: 0
Llm As JudgeAutomatic Metrics
Our method features a specialized judge model trained with out-of-distribution data (AIME 2022, AIME 2023, and MATH 500) using oversampling techniques to accurately predict answer equivalence from partial reasoning traces, achieving 0.7072…
- SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?
Azmine Toushik Wasi, Wahid Faisal, Abdur Rahman, Mahfuz Ahmed Anik, Munem Shahriar · Feb 3, 2026 · Citations: 0
Automatic Metrics
To address this, we introduce SpatiaLab, a comprehensive benchmark for evaluating VLMs' spatial reasoning in realistic, unconstrained contexts.
- Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models
Teng Wang, Zhangyi Jiang, Zhenqi He, Shenyang Tong, Wenhan Yang · Mar 16, 2025 · Citations: 0
Automatic Metrics
Empirical results on the PRM800K dataset show that HRM, together with HNC, provides more stable and reliable evaluations than PRM.
- ReCoN-Ipsundrum: An Inspectable Recurrent Persistence Loop Agent with Affect-Coupled Control and Mechanism-Linked Consciousness Indicator Assays
Aishik Sanyal · Feb 26, 2026 · Citations: 0
Pairwise Preference
Inspired by Humphrey's ipsundrum hypothesis, we implement ReCoN-Ipsundrum, an inspectable agent that extends a ReCoN state machine with a recurrent persistence loop over sensory salience Ns and an optional affect proxy reporting…
- Schema for In-Context Learning
Pan Chen, Shaohong Chen, Mark Wang, Shi Xuan Leong, Priscilla Fung · Oct 14, 2025 · Citations: 0
Demonstrations
Inspired by cognitive science, specifically schema theory, which holds that humans interpret new information by activating pre-existing mental frameworks (schemas) to structure understanding, we introduce Schema-Activated In-Context…
- Context Over Content: Exposing Evaluation Faking in Automated Judges
Manan Gupta, Inderjeet Nair, Lu Wang, Dhruv Kumar · Apr 16, 2026 · Citations: 0
- An Empirical Investigation of Practical LLM-as-a-Judge Improvement Techniques on RewardBench 2
Ryan Lail · Apr 15, 2026 · Citations: 0
- IndicDB -- Benchmarking Multilingual Text-to-SQL Capabilities in Indian Languages
Aviral Dawar, Roshan Karanth, Vikram Goyal, Dhruv Kumar · Apr 15, 2026 · Citations: 0
- One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness
Erfan Baghaei Potraghloo, Seyedarmin Azizi, Souvik Kundu, Massoud Pedram · Apr 14, 2026 · Citations: 0
- Think in Sentences: Explicit Sentence Boundaries Enhance Language Model's Capabilities
Zhichen Liu, Yongyuan Li, Yang Xu · Apr 11, 2026 · Citations: 0
- Reproduction Beyond Benchmarks: ConstBERT and ColBERT-v2 Across Backends and Query Distributions
Utshab Kumar Ghosh, Ashish David, Shubham Chatterjee · Apr 11, 2026 · Citations: 0
- DMax: Aggressive Parallel Decoding for dLLMs
Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, Xinchao Wang · Apr 9, 2026 · Citations: 0
- TEMPER: Testing Emotional Perturbation in Quantitative Reasoning
Atahan Dokme, Benjamin Reichman, Larry Heck · Apr 9, 2026 · Citations: 0
- Sensitivity-Positional Co-Localization in GQA Transformers
Manoj Chandrashekar Rao · Apr 9, 2026 · Citations: 0
- Squeeze Evolve: Unified Multi-Model Orchestration for Verifier-Free Evolution
Monishwaran Maheswaran, Leon Lakhani, Zhongzhu Zhou, Shijia Yang, Junxiong Wang · Apr 9, 2026 · Citations: 0
- IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
David Gringras · Apr 9, 2026 · Citations: 0
- Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models
Sarath Shekkizhar, Romain Cosentino, Adam Earle · Apr 2, 2026 · Citations: 0
- Ego-Grounding for Personalized Question-Answering in Egocentric Videos
Junbin Xiao, Shenglang Zhang, Pengxiang Zhu, Angela Yao · Apr 2, 2026 · Citations: 0
- Is Mathematical Problem-Solving Expertise in Large Language Models Associated with Assessment Performance?
Liang Zhang, Yu Fu, Xinyi Jin · Mar 26, 2026 · Citations: 0
- Cross-Model Disagreement as a Label-Free Correctness Signal
Matt Gorbett, Suman Jana · Mar 26, 2026 · Citations: 0
- Photon: Speedup Volume Understanding with Efficient Multimodal Large Language Models
Chengyu Fang, Heng Guo, Zheng Jiang, Chunming He, Xiu Li · Mar 26, 2026 · Citations: 0
- Mechanistically Interpreting Compression in Vision-Language Models
Veeraraju Elluru, Arth Singh, Roberto Aguero, Ajay Agarwal, Debojyoti Das · Mar 26, 2026 · Citations: 0
- Efficient Detection of Bad Benchmark Items with Novel Scalability Coefficients
Michael Hardy, Joshua Gilbert, Benjamin Domingue · Mar 26, 2026 · Citations: 0
- Off-Policy Value-Based Reinforcement Learning for Large Language Models
Peng-Yuan Wang, Ziniu Li, Tian Xu, Bohan Yang, Tian-Shuo Liu · Mar 24, 2026 · Citations: 0
- RelayS2S: A Dual-Path Speculative Generation for Real-Time Dialogue
Long Mai · Mar 24, 2026 · Citations: 0
- Edge Radar Material Classification Under Geometry Shifts
Jannik Hohmann, Dong Wang, Andreas Nüchter · Mar 24, 2026 · Citations: 0
- Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?
Richard J. Young · Mar 23, 2026 · Citations: 0
- Unified Spatio-Temporal Token Scoring for Efficient Video VLMs
Jianrui Zhang, Yue Yang, Rohun Tripathi, Winson Han, Ranjay Krishna · Mar 18, 2026 · Citations: 0
- Datasets for Verb Alternations across Languages: BLM Templates and Data Augmentation Strategies
Giuseppe Samo, Paola Merlo · Mar 16, 2026 · Citations: 0
- Attention Residuals
Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu · Mar 16, 2026 · Citations: 0
- MXNorm: Reusing MXFP block scales for efficient tensor normalisation
Callum McLean, Luke Y. Prince, Alexandre Payot, Paul Balança, Carlo Luschi · Mar 13, 2026 · Citations: 0
- TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning
Alliot Nagle, Jakhongir Saydaliev, Dhia Garbaya, Michael Gastpar, Ashok Vardhan Makkuva · Mar 13, 2026 · Citations: 0
- NeuroLoRA: Context-Aware Neuromodulation for Parameter-Efficient Multi-Task Adaptation
Yuxin Yang, Haoran Zhang, Mingxuan Li, Jiachen Xu, Ruoxi Shen · Mar 12, 2026 · Citations: 0
- LABSHIELD: A Multimodal Benchmark for Safety-Critical Reasoning and Planning in Scientific Laboratories
Qianpu Sun, Xiaowei Chi, Yuhan Rui, Ying Li, Kuangzhi Ge · Mar 12, 2026 · Citations: 0
- UIS-Digger: Towards Comprehensive Research Agent Systems for Real-world Unindexed Information Seeking
Chang Liu, Chuqiao Kuang, Tianyi Zhuang, Yuxin Cheng, Huichi Zhou · Mar 9, 2026 · Citations: 0
- Dynamic Vehicle Routing Problem with Prompt Confirmation of Advance Requests
Amutheezan Sivagnanam, Ayan Mukhopadhyay, Samitha Samaranayake, Abhishek Dubey, Aron Laszka · Mar 8, 2026 · Citations: 0
- In-Context Environments Induce Evaluation-Awareness in Language Models
Maheep Chaudhary · Mar 4, 2026 · Citations: 0
- Confidence-Calibrated Small-Large Language Model Collaboration for Cost-Efficient Reasoning
Chuang Zhang, Zizhen Zhu, Yihao Wei, Bing Tian, Junyi Liu · Mar 4, 2026 · Citations: 0
- CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning
Xinyu Zhu, Yihao Feng, Yanchao Sun, Xianzhi Du, Pingzhi Li · Mar 1, 2026 · Citations: 0
- Polynomial Mixing for Efficient Self-supervised Speech Encoders
Eva Feillet, Ryan Whetten, David Picard, Alexandre Allauzen · Feb 28, 2026 · Citations: 0
- Draft-Thinking: Learning Efficient Reasoning in Long Chain-of-Thought LLMs
Jie Cao, Tianwei Lin, Zhenxuan Fan, Bo Yuan, Ziyuan Zhao · Feb 28, 2026 · Citations: 0