- CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics
Vaibhav Devraj, Dhruv Kumar, Jagat Sesh Challa, Parth Agarwal, Navya Kommuri · Dec 26, 2025 · Citations: 0
Expert Verification Automatic Metrics
To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data.
- FairMed-XGB: A Bayesian-Optimised Multi-Metric Framework with Explainability for Demographic Equity in Critical Healthcare Data
Mitul Goswami, Romit Chatterjee, Arif Ahmed Sekh · Mar 16, 2026 · Citations: 0
Expert Verification Automatic Metrics
Post-mitigation evaluation on seven clinically distinct cohorts derived from the MIMIC-IV-ED and eICU databases demonstrates substantial bias reduction: Statistical Parity Difference decreases by 40 to 51 percent on MIMIC-IV-ED and 10 to 19…
- DSPO: Stable and Efficient Policy Optimization for Agentic Search and Reasoning
Chenyang Gu, Yewen Pu, Bruce Yang, Xiaofan Li, Huan Gao · Oct 10, 2025 · Citations: 0
Demonstrations Simulation Env
Current approaches either rely on prompting to elicit the model's innate agent capabilities, or suffer from performance ceilings and collapse when applying RL to complex interactive tasks, leaving their true agentic potential untapped.
- Structured Agent Distillation for Large Language Model
Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Tianqi Li · May 20, 2025 · Citations: 0
Demonstrations Simulation Env
Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-style frameworks.
- Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP
Lorenz Hufe, Constantin Venhoff, Erblina Purelku, Maximilian Dreyer, Sebastian Lapuschkin · Aug 28, 2025 · Citations: 0
Red Team Automatic Metrics
These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.
- LLM-as-a-Judge for Time Series Explanations
Preetham Sivalingam, Murari Mandal, Saurabh Deshpande, Dhruv Kumar · Apr 2, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics
Although modern models generate textual interpretations of numerical signals, existing evaluation methods are limited: reference based similarity metrics and consistency checking models require ground truth explanations, while traditional…
- MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents
Shu Wang, Edwin Yu, Oscar Love, Tom Zhang, Tom Wong · Apr 6, 2026 · Citations: 0
Automatic Metrics
Large Language Model (LLM) agents require persistent memory to maintain personalization, factual continuity, and long-horizon reasoning, yet standard context-window and retrieval-augmented generation (RAG) pipelines degrade over…
- OSCAR: Orchestrated Self-verification and Cross-path Refinement
Yash Shah, Abhijit Chakraborty, Naresh Kumar Devulapally, Vishnu Lokhande, Vivek Gupta · Apr 2, 2026 · Citations: 0
Automatic Metrics
We introduce a suite of trajectory-level assessments, including a cross-chain divergence-at-hallucination (CDH) metric, for principled comparison of localization methods.
- Top-b: Entropic Regulation of Relative Probability Bands in Autoregressive Language Processes
Deepon Halder, Raj Dabre · Mar 15, 2026 · Citations: 0
Automatic Metrics
Empirical validation on GPQA and GSM8K benchmarks indicates that Top-b significantly reduces generation entropy and inter-decoding variance while maintaining competitive reasoning accuracy, effectively approximating a self-regulating…
- D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models
Shunsuke Ubukata · Feb 25, 2026 · Citations: 0
Automatic Metrics
In this study, we propose Disciplined Chain-of-Thought (D-CoT), a novel framework that enforces a structured reasoning process using control tags -- such as <TEMP_LOW> for fact-checking and <TEMP_HIGH> for multi-perspective exploration --…
- AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents
Jingxu Xie, Dylan Xu, Xuandong Zhao, Dawn Song · Jun 17, 2025 · Citations: 0
Automatic Metrics
We introduce AgentSynth, a scalable and cost-efficient pipeline for automatically synthesizing high-quality tasks and trajectory datasets for generalist computer-use agents.
- DeepPrune: Parallel Scaling without Inter-trace Redundancy
Shangqing Tu, Yaxuan Li, Yushi Bai, Lei Hou, Juanzi Li · Oct 9, 2025 · Citations: 0
Llm As JudgeAutomatic Metrics
Our method features a specialized judge model trained with out-of-distribution data (AIME 2022, AIME 2023, and MATH 500) using oversampling techniques to accurately predict answer equivalence from partial reasoning traces, achieving 0.7072…
- SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?
Azmine Toushik Wasi, Wahid Faisal, Abdur Rahman, Mahfuz Ahmed Anik, Munem Shahriar · Feb 3, 2026 · Citations: 0
Automatic Metrics
To address this, we introduce SpatiaLab, a comprehensive benchmark for evaluating VLMs' spatial reasoning in realistic, unconstrained contexts.
- RELOOP: Recursive Retrieval with Multi-Hop Reasoner and Planners for Heterogeneous QA
Ruiyi Yang, Hao Xue, Imran Razzak, Hakim Hacid, Flora D. Salim · Oct 23, 2025 · Citations: 0
Automatic Metrics
A Head Agent provides guidance that leads retrieval, while an Iteration Agent selects and expands HSeq via structure-respecting actions (e.g., parent/child hops, table row/column neighbors, KG relations); Finally the head agent composes…
- Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs
Ziliang Wang, Kang An, Xuhui Zheng, Faqiang Qian, Weikun Zhang · Oct 1, 2025 · Citations: 0
Automatic Metrics
We propose Erasable Reinforcement Learning (ERL), a novel framework that transforms fragile reasoning into a robust process.
- CLAUSE: Agentic Neuro-Symbolic Knowledge Graph Reasoning via Dynamic Learnable Context Engineering
Yang Zhao, Chengxiao Dai, Wei Zhuo, Yue Xiu, Dusit Niyato · Sep 25, 2025 · Citations: 0
Automatic Metrics
We introduce CLAUSE, an agentic three-agent neuro-symbolic framework that treats context construction as a sequential decision process over knowledge graphs, deciding what to expand, which paths to follow or backtrack, what evidence to…
- ReCoN-Ipsundrum: An Inspectable Recurrent Persistence Loop Agent with Affect-Coupled Control and Mechanism-Linked Consciousness Indicator Assays
Aishik Sanyal · Feb 26, 2026 · Citations: 0
Pairwise Preference
Inspired by Humphrey's ipsundrum hypothesis, we implement ReCoN-Ipsundrum, an inspectable agent that extends a ReCoN state machine with a recurrent persistence loop over sensory salience Ns and an optional affect proxy reporting…
- Schema for In-Context Learning
Pan Chen, Shaohong Chen, Mark Wang, Shi Xuan Leong, Priscilla Fung · Oct 14, 2025 · Citations: 0
Demonstrations
Inspired by cognitive science, specifically schema theory, which holds that humans interpret new information by activating pre-existing mental frameworks (schemas) to structure understanding, we introduce Schema-Activated In-Context…
- Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems
Ye Yu, Heming Liu, Haibo Jin, Xiaopeng Yuan, Peng Kuang · Apr 23, 2026 · Citations: 0
- Process Supervision via Verbal Critique Improves Reasoning in Large Language Models
Hao-Yuan Chen · Apr 23, 2026 · Citations: 0
- Beyond Single Plots: A Benchmark for Question Answering on Multi-Charts
Azher Ahmed Efat, Seok Hwan Song, Wallapak Tavanapong · Apr 23, 2026 · Citations: 0
- TRACES: Tagging Reasoning Steps for Adaptive Cost-Efficient Early-Stopping
Yannis Belkhiter, Seshu Tirupathi, Giulio Zizzo, John D. Kelleher · Apr 22, 2026 · Citations: 0
- MOMO: A framework for seamless physical, verbal, and graphical robot skill learning and adaptation
Markus Knauer, Edoardo Fiorini, Maximilian Mühlbauer, Stefan Schneyer, Promwat Angsuratanawech · Apr 22, 2026 · Citations: 0
- HaS: Accelerating RAG through Homology-Aware Speculative Retrieval
Peng Peng, Weiwei Lin, Wentai Wu, Xinyang Wang, Yongheng Liu · Apr 22, 2026 · Citations: 0
- Detoxification for LLM: From Dataset Itself
Wei Shao, Yihang Wang, Gaoyu Zhu, Ziqiang Cheng, Lei Yu · Apr 21, 2026 · Citations: 0
- Answer Only as Precisely as Justified: Calibrated Claim-Level Specificity Control for Agentic Systems
Tianyi Huang, Samuel Xu, Jason Tansong Dang, Samuel Yan, Kimberley Yin · Apr 19, 2026 · Citations: 0
- Cat-DPO: Category-Adaptive Safety Alignment
Tiankai Yang, Yi Nian, Xinyuan Li, Ruiyao Xu, Kaize Ding · Apr 19, 2026 · Citations: 0
- Neurosymbolic Repo-level Code Localization
Xiufeng Xu, Xiufeng Wu, Zejun Zhang, Yi Li · Apr 17, 2026 · Citations: 0
- Target-Oriented Pretraining Data Selection via Neuron-Activated Graph
Zijun Wang, Haoqin Tu, Weidong Zhou, Yiyang Zhou, Xiaohuan Zhou · Apr 17, 2026 · Citations: 0
- Context Over Content: Exposing Evaluation Faking in Automated Judges
Manan Gupta, Inderjeet Nair, Lu Wang, Dhruv Kumar · Apr 16, 2026 · Citations: 0
- An Empirical Investigation of Practical LLM-as-a-Judge Improvement Techniques on RewardBench 2
Ryan Lail · Apr 15, 2026 · Citations: 0
- IndicDB -- Benchmarking Multilingual Text-to-SQL Capabilities in Indian Languages
Aviral Dawar, Roshan Karanth, Vikram Goyal, Dhruv Kumar · Apr 15, 2026 · Citations: 0
- One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness
Erfan Baghaei Potraghloo, Seyedarmin Azizi, Souvik Kundu, Massoud Pedram · Apr 14, 2026 · Citations: 0
- A Triadic Suffix Tokenization Scheme for Numerical Reasoning
Olga Chetverina · Apr 13, 2026 · Citations: 0
- Think in Sentences: Explicit Sentence Boundaries Enhance Language Model's Capabilities
Zhichen Liu, Yongyuan Li, Yang Xu · Apr 11, 2026 · Citations: 0
- Reproduction Beyond Benchmarks: ConstBERT and ColBERT-v2 Across Backends and Query Distributions
Utshab Kumar Ghosh, Ashish David, Shubham Chatterjee · Apr 11, 2026 · Citations: 0
- Sensitivity-Positional Co-Localization in GQA Transformers
Manoj Chandrashekar Rao · Apr 9, 2026 · Citations: 0
- Squeeze Evolve: Unified Multi-Model Orchestration for Verifier-Free Evolution
Monishwaran Maheswaran, Leon Lakhani, Zhongzhu Zhou, Shijia Yang, Junxiong Wang · Apr 9, 2026 · Citations: 0
- IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
David Gringras · Apr 9, 2026 · Citations: 0
- Ego-Grounding for Personalized Question-Answering in Egocentric Videos
Junbin Xiao, Shenglang Zhang, Pengxiang Zhu, Angela Yao · Apr 2, 2026 · Citations: 0
- Photon: Speedup Volume Understanding with Efficient Multimodal Large Language Models
Chengyu Fang, Heng Guo, Zheng Jiang, Chunming He, Xiu Li · Mar 26, 2026 · Citations: 0
- Mechanistically Interpreting Compression in Vision-Language Models
Veeraraju Elluru, Arth Singh, Roberto Aguero, Ajay Agarwal, Debojyoti Das · Mar 26, 2026 · Citations: 0
- Off-Policy Value-Based Reinforcement Learning for Large Language Models
Peng-Yuan Wang, Ziniu Li, Tian Xu, Bohan Yang, Tian-Shuo Liu · Mar 24, 2026 · Citations: 0
- RelayS2S: A Dual-Path Speculative Generation for Real-Time Dialogue
Long Mai · Mar 24, 2026 · Citations: 0
- Edge Radar Material Classification Under Geometry Shifts
Jannik Hohmann, Dong Wang, Andreas Nüchter · Mar 24, 2026 · Citations: 0
- How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data
Zixian Huang, Kaichen Yang, Xu Huang, Feiyang Hao, Qiming Ge · Mar 23, 2026 · Citations: 0
- Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?
Richard J. Young · Mar 23, 2026 · Citations: 0
- Unified Spatio-Temporal Token Scoring for Efficient Video VLMs
Jianrui Zhang, Yue Yang, Rohun Tripathi, Winson Han, Ranjay Krishna · Mar 18, 2026 · Citations: 0
- IndexRAG: Bridging Facts for Cross-Document Reasoning at Index Time
Zhenghua Bao, Yi Shi · Mar 17, 2026 · Citations: 0
- Datasets for Verb Alternations across Languages: BLM Templates and Data Augmentation Strategies
Giuseppe Samo, Paola Merlo · Mar 16, 2026 · Citations: 0
- Attention Residuals
Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu · Mar 16, 2026 · Citations: 0
- MXNorm: Reusing MXFP block scales for efficient tensor normalisation
Callum McLean, Luke Y. Prince, Alexandre Payot, Paul Balança, Carlo Luschi · Mar 13, 2026 · Citations: 0
- TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning
Alliot Nagle, Jakhongir Saydaliev, Dhia Garbaya, Michael Gastpar, Ashok Vardhan Makkuva · Mar 13, 2026 · Citations: 0
- LABSHIELD: A Multimodal Benchmark for Safety-Critical Reasoning and Planning in Scientific Laboratories
Qianpu Sun, Xiaowei Chi, Yuhan Rui, Ying Li, Kuangzhi Ge · Mar 12, 2026 · Citations: 0
- UIS-Digger: Towards Comprehensive Research Agent Systems for Real-world Unindexed Information Seeking
Chang Liu, Chuqiao Kuang, Tianyi Zhuang, Yuxin Cheng, Huichi Zhou · Mar 9, 2026 · Citations: 0
- Dynamic Vehicle Routing Problem with Prompt Confirmation of Advance Requests
Amutheezan Sivagnanam, Ayan Mukhopadhyay, Samitha Samaranayake, Abhishek Dubey, Aron Laszka · Mar 8, 2026 · Citations: 0
- Confidence-Calibrated Small-Large Language Model Collaboration for Cost-Efficient Reasoning
Chuang Zhang, Zizhen Zhu, Yihao Wei, Bing Tian, Junyi Liu · Mar 4, 2026 · Citations: 0
- CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning
Xinyu Zhu, Yihao Feng, Yanchao Sun, Xianzhi Du, Pingzhi Li · Mar 1, 2026 · Citations: 0
- Polynomial Mixing for Efficient Self-supervised Speech Encoders
Eva Feillet, Ryan Whetten, David Picard, Alexandre Allauzen · Feb 28, 2026 · Citations: 0
- Draft-Thinking: Learning Efficient Reasoning in Long Chain-of-Thought LLMs
Jie Cao, Tianwei Lin, Zhenxuan Fan, Bo Yuan, Ziyuan Zhao · Feb 28, 2026 · Citations: 0