- CRAFT: Calibrated Reasoning with Answer-Faithful Traces via Reinforcement Learning for Multi-Hop Question Answering
Yu Liu, Wenxiao Zhang, Diandian Guo, Cong Cao, Fangfang Yuan · Feb 1, 2026 · Citations: 0
Training combines two complementary forms of supervision: deterministic rewards enforce verifiable constraints, including format compliance, answer correctness, and citation-set validity, while a judge-based reward audits semantic…
- Evaluating Long-Horizon Memory for Multi-Party Collaborative Dialogues
Chuanrui Hu, Tong Li, Xingze Gao, Hongda Chen, Yi Bai · Feb 1, 2026 · Citations: 0
Long Horizon
In this paper, we introduce EverMemBench, the first benchmark designed for long-horizon collaborative memory, built from multi-party, multi-group conversations spanning over one million tokens with dense cross-topic interleaving, temporally…
- RE-MCDF: Closed-Loop Multi-Expert LLM Reasoning for Knowledge-Grounded Clinical Diagnosis
Shaowei Shen, Xiaohong Yang, Jie Yang, Lianfen Huang, Yongcai Zhang · Feb 1, 2026 · Citations: 0
Critique Edit Multi Agent
In such settings, single-agent systems are vulnerable to self-reinforcing errors, as their predictions lack independent validation and can drift toward spurious conclusions.
- Learnable Koopman-Enhanced Transformer-Based Time Series Forecasting with Spectral Control
Ali Forootani, Raffaele Iervolino · Feb 1, 2026 · Citations: 0
- EvoOpt-LLM: Evolving industrial optimization models with large language models
Yiliu He, Tianle Li, Binghao Ji, Zhiyuan Liu, Di Huang · Feb 1, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- What If We Allocate Test-Time Compute Adaptively?
Ahsan Bilal, Ahmed Mohsin, Muhammad Umer, Ali Subhan, Hassan Rizwan · Feb 1, 2026 · Citations: 0
Long Horizon
For each problem, the agent runs multiple inference iterations.
- Residual Decoding: Mitigating Hallucinations in Large Vision-Language Models via History-Aware Residual Guidance
Xinrong Chen, Xu Chu, Yingmin Qiu, Hengyuan Zhang, Jing Xiong · Feb 1, 2026 · Citations: 0
- Do Schwartz Higher-Order Values Help Sentence-Level Human Value Detection? A Study of Hierarchical Gating and Calibration
Víctor Yeste, Paolo Rosso · Jan 31, 2026 · Citations: 0
Human value detection from single sentences is a sparse, imbalanced multi-label task.
- Hallucination is a Consequence of Space-Optimality: A Rate-Distortion Theorem for Membership Testing
Anxin Guo, Jingwei Li · Jan 31, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- A Minimum Variance Path Principle for Accurate and Stable Score-Based Density Ratio Estimation
Wei Chen, Jiacheng Li, Shigui Li, Zhiqi Lin, Junmei Yang · Jan 31, 2026 · Citations: 0
- Beyond Static Instruction: A Multi-agent AI Framework for Adaptive Augmented Reality Robot Training
Nicolas Leins, Jana Gonnermann-Müller, Malte Teichmann, Sebastian Pokutta · Jan 31, 2026 · Citations: 0
- Can Small Language Models Handle Context-Summarized Multi-Turn Customer-Service QA? A Synthetic Data-Driven Comparative Evaluation
Lakshan Cooray, Deshan Sumanathilaka, Pattigadapa Venkatesh Raju · Jan 31, 2026 · Citations: 0
Nine instruction-tuned low-parameterized SLMs are evaluated against three commercial LLMs using lexical and semantic similarity metrics alongside qualitative assessments, including human evaluation and LLM-as-a-judge methods.
- From Associations to Activations: Comparing Behavioral and Hidden-State Semantic Geometry in LLMs
Louis Schiekiera, Max Zimmer, Christophe Roux, Sebastian Pokutta, Fritz Günther · Jan 31, 2026 · Citations: 0
Using representational similarity analysis, we compare behavioral geometries to layerwise hidden-state similarity and benchmark against FastText, BERT, and cross-model consensus.
- Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs
Xiang Zheng, Weiqi Zhai, Wei Wang, Boyu Yang, Wenbo Li · Jan 31, 2026 · Citations: 0
Recent large language models (LLMs) achieve near-saturation accuracy on many established mathematical reasoning benchmarks, raising concerns about their ability to diagnose genuine reasoning competence.
- Intention-Adaptive LLM Fine-Tuning for Text Revision Generation
Zhexiong Liu, Diane Litman · Jan 31, 2026 · Citations: 0
Critique Edit
To address these challenges, we propose Intention-Tuning, an intention-adaptive layer-wise LLM fine-tuning framework that dynamically selects a subset of LLM layers to learn the intentions and subsequently transfers their representations to…
- Detecting AI-Generated Content in Academic Peer Reviews
Siyuan Shen, Kai Wang · Jan 30, 2026 · Citations: 0
Together, these findings provide suggestive evidence of a rapidly increasing presence of AI-assisted content in peer review and highlight the need for further study of its implications for scholarly evaluation.
- PaperBanana: Automating Academic Illustration for AI Scientists
Dawei Zhu, Rui Meng, Yale Song, Xiyu Wei, Sujian Li · Jan 30, 2026 · Citations: 0
Critique Edit
To lift this burden, we introduce PaperBanana, an agentic framework for automated generation of publication-ready academic illustrations.
- Mem-T: Densifying Rewards for Long-Horizon Memory Agents
Yanwei Yue, Boci Peng, Xuanbo Fan, Jiaxin Guo, Qiankun Li · Jan 30, 2026 · Citations: 0
- Should LLMs, like, Generate How Users Talk? Building Dialect-Accurate Dialog[ue]s Beyond the American Default with MDial
Jio Oh, Paul Vicinanza, Thomas Butler, Steven Euijong Whang, Dezhi Hong · Jan 30, 2026 · Citations: 0
Pairwise Preference
Independent evaluations confirm data quality, with annotators preferring MDial outputs over prior methods in 98% of pairwise comparisons for dialect naturalness.
- TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization
Shichao Ma, Zhiyuan Ma, Ming Yang, Xiaofan Li, Xing Wu · Jan 30, 2026 · Citations: 0
- OpenVTON-Bench: A Large-Scale High-Resolution Benchmark for Controllable Virtual Try-On Evaluation
Jin Li, Tao Chen, Shuai Jiang, Weijie Wang, Jingwen Luo · Jan 30, 2026 · Citations: 0
We present OpenVTON-Bench, a large-scale benchmark comprising approximately 100K high-resolution image pairs (up to 1536 \times 1536).
- KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models
Zukang Xu, Zhixiong Zhao, Xing Hu, Zhixuan Chen, Dawei Yang · Jan 30, 2026 · Citations: 0
Mixture of Experts (MoE) models have achieved great success by significantly improving performance while maintaining computational efficiency through sparse expert activation.
- Time-Annealed Perturbation Sampling: Diverse Generation for Diffusion Language Models
Jingxuan Wu, Zhenglin Wan, Xingrui Yu, Yuzhe Yang, Yiqiao Huang · Jan 30, 2026 · Citations: 0
- From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents
Jiaxuan Gao, Jiaao Chen, Chuyi He, Shusheng Xu, Di Jin · Jan 30, 2026 · Citations: 0
Long Horizon
Interactive tool-using agents must solve real-world tasks via multi-turn interaction with both humans and external environments, requiring dialogue state tracking, multi-step tool execution, while following complex instructions.
- Mock Worlds, Real Skills: Building Small Agentic Language Models with Synthetic Tasks, Simulated Environments, and Rubric-Based Rewards
Yuanjie Lyu, Chengyu Wang, Lei Shen, Jun Huang, Tong Xu · Jan 30, 2026 · Citations: 0
Rubric Rating Tool Use
Small LLMs often struggle to match the agentic capabilities of large, costly models.
- AI and My Values: User Perceptions of LLMs' Ability to Extract, Embody, and Explain Human Values from Casual Conversations
Bhada Yun, Renn Su, April Yi Wang · Jan 30, 2026 · Citations: 0
Does AI understand human values?
- Sheaf Neural Networks and biomedical applications
Aneeqa Mehrab, Jan Willem Van Looy, Pietro Demurtas, Stefano Iotti, Emil Malucelli · Jan 29, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- RedSage: A Cybersecurity Generalist LLM
Naufal Suryanto, Muzammal Naseer, Pengfei Li, Syed Talal Wasim, Jinhui Yi · Jan 29, 2026 · Citations: 0
- From Generative Modeling to Clinical Classification: A GPT-Based Architecture for EHR Notes
Fariba Afrin Irany, Sampson Akwafuo · Jan 29, 2026 · Citations: 0
- Learn-to-Distance: Distance Learning for Detecting LLM-Generated Text
Hongyi Zhou, Jin Zhu, Kai Ye, Ying Yang, Erhan Xu · Jan 29, 2026 · Citations: 0
Yet, their ability to produce highly human-like text raises serious concerns about misinformation and academic integrity, making it an urgent need for reliable algorithms to detect LLM-generated content.
- WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents
Yao Zhang, Shijie Tang, Zeyu Li, Zhen Han, Volker Tresp · Jan 29, 2026 · Citations: 0
- MoHETS: Long-term Time Series Forecasting with Mixture-of-Heterogeneous-Experts
Evandro S. Ortigossa, Guy Lutsker, Eran Segal · Jan 29, 2026 · Citations: 0
- Embodied Task Planning via Graph-Informed Action Generation with Large Language Model
Xiang Li, Ning Yan, Masood Mortazavi · Jan 29, 2026 · Citations: 0
Long Horizon
We propose GiG, a novel planning framework that structures embodied agents' memory using a Graph-in-Graph architecture.
- Indic-TunedLens: Interpreting Multilingual Models in Indian Languages
Mihir Panchal, Deeksha Varshney, Mamta, Asif Ekbal · Jan 29, 2026 · Citations: 0
We evaluate our framework on 10 Indian languages using the MMLU benchmark and find that it significantly improves over SOTA interpretability methods, especially for morphologically rich, low resource languages.
- Temporal Sepsis Modeling: a Fully Interpretable Relational Way
Vincent Lemaire, Nédra Meloulli, Pierre Jaquet · Jan 29, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- FBS: Modeling Native Parallel Reading inside a Transformer
Tongxi Wang · Jan 29, 2026 · Citations: 0
Existing acceleration methods largely patch this pipeline and miss core human-reading ingredients: content-adaptive foresight, chunk-structure-aware compute allocation, and train-test consistency for preview/skimming.
- MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning
Yaorui Shi, Shugui Liu, Yu Yang, Wenyu Mao, Yuxin Chen · Jan 29, 2026 · Citations: 0
- EnsembleLink: Accurate Record Linkage Without Training Data
Noah Dasanaike · Jan 29, 2026 · Citations: 0
Tool Use
On benchmarks spanning city names, person names, organizations, multilingual political parties, and bibliographic records, EnsembleLink matches or exceeds methods requiring extensive labeling.
- INSURE-Dial: A Phase-Aware Conversational Dataset & Benchmark for Compliance Verification and Phase Detection
Shubham Kulkarni, Alexander Lyzhov, Preetam Joshi, Shiva Chaitanya · Jan 28, 2026 · Citations: 0
Web Browsing
We introduce INSURE-Dial, to our knowledge the first public benchmark for developing and assessing compliance-aware voice agents for phase-aware call auditing with span-based compliance verification.
- Learning Contextual Runtime Monitors for Safe AI-Based Autonomy
Alejandro Luque-Cerpa, Mengyuan Wang, Emil Carlsson, Sanjit A. Seshia, Devdatt Dubhashi · Jan 28, 2026 · Citations: 0
- CLEAR-Mamba:Towards Accurate, Adaptive and Trustworthy Multi-Sequence Ophthalmic Angiography Classification
Zhuonan Wang, Wenjie Yan, Wenqiao Zhang, Xiaohui Song, Jian Ma · Jan 28, 2026 · Citations: 0
- CCMamba: Topologically-Informed Selective State-Space Networks on Combinatorial Complexes for Higher-Order Graph Learning
Jiawen Chen, Qi Shao, Mingtong Zhou, Duxin Chen, Wenwu Yu · Jan 28, 2026 · Citations: 0
- MuVaC: A Variational Causal Framework for Multimodal Sarcasm Understanding in Dialogues
Diandian Guo, Fangfang Yuan, Cong Cao, Xixun Lin, Chuan Zhou · Jan 28, 2026 · Citations: 0
To bridge this gap, we propose MuVaC, a variational causal inference framework that mimics human cognitive mechanisms for understanding sarcasm, enabling robust multimodal feature learning to jointly optimize MSD and MuSE.
- Text-only adaptation in LLM-based ASR through text denoising
Andrés Carofilis, Sergio Burdisso, Esaú Villatoro-Tello, Shashi Kumar, Kadri Hacioglu · Jan 28, 2026 · Citations: 0
Extensive evaluation on two datasets demonstrates up to 22.1% relative improvement, outperforming recent state-of-the-art text-only adaptation methods.
- Self Voice Conversion as an Attack against Neural Audio Watermarking
Yigitcan Özer, Wanying Ge, Zhe Zhang, Xin Wang, Junichi Yamagishi · Jan 28, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Meta-Cognitive Reinforcement Learning with Self-Doubt and Recovery
Zhipeng Zhang, Xiongfei Su, Kai Li · Jan 28, 2026 · Citations: 0
In this work, we propose a meta-cognitive reinforcement learning framework that enables an agent to assess, regulate, and recover its learning behavior based on internally estimated reliability signals.
- Improving X-Codec-2.0 for Multi-Lingual Speech: 25 Hz Latent Rate and 24 kHz Sampling
Husein Zolkepli · Jan 28, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Rethinking Discrete Speech Representation Tokens for Accent Generation
Jinzuomu Zhong, Yi Wang, Korin Richmond, Peter Bell · Jan 27, 2026 · Citations: 0
We propose a unified evaluation framework that measures both accessibility of accent information via a novel Accent ABX task and recoverability via cross-accent Voice Conversion (VC) resynthesis.
- Understanding LLM Failures: A Multi-Tape Turing Machine Analysis of Systematic Errors in Language Model Reasoning
Magnus Boman · Jan 27, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- One Token Is Enough: Improving Diffusion Language Models with a Sink Token
Zihou Zhang, Zheyong Xie, Li Zhong, Haifeng Liu, Yao Hu · Jan 27, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Do LLMs Truly Benefit from Longer Context in Automatic Post-Editing?
Ahrii Kim, Seong-heum Kim · Jan 27, 2026 · Citations: 0
Our results show that proprietary LLMs achieve near human-level APE quality even with simple one-shot prompting, regardless of whether document context is provided.
- Formula-One Prompting: Equation-First Reasoning For Applied Mathematics
Natapong Nitarach, Pittawat Taveekitworachai, Kunat Pipatanakul · Jan 27, 2026 · Citations: 0
Results across five models and four benchmarks show F-1 outperforms CoT by +5.76% and PoT by +8.42% on average, winning 53 out of 60 benchmark-model comparisons (88.3%).
- CollectiveKV: Decoupling and Sharing Collaborative Information in Sequential Recommendation
Jingyu Li, Zhaocheng Du, Qianhui Zhu, kaiyuan Li, Zhicheng Zhang · Jan 27, 2026 · Citations: 0
- FROST: Filtering Reasoning Outliers with Attention for Efficient Reasoning
Haozheng Luo, Zhuolin Jiang, Md Zahid Hasan, Yan Chen, Soumalya Sarkar · Jan 26, 2026 · Citations: 0
Empirically, we validate FROST on four benchmarks using two strong reasoning models (Phi-4-Reasoning and GPT-OSS-20B), outperforming state-of-the-art methods such as TALE and ThinkLess.
- A Geometric Taxonomy of Hallucinations in LLMs
Javier Marín · Jan 26, 2026 · Citations: 0
DGI achieves AUROC=0.958 on human-crafted confabulations with 3.8% cross-domain degradation.
- LLMs versus the Halting Problem: Revisiting Program Termination Prediction
Oren Sultan, Jordi Armengol-Estape, Pascal Kesseli, Julien Vanegue, Dafna Shahaf · Jan 26, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- BabyReasoningBench: Generating Developmentally-Inspired Reasoning Tasks for Evaluating Baby Language Models
Kaustubh D. Dhole · Jan 26, 2026 · Citations: 0
Traditional evaluations of reasoning capabilities of language models are dominated by adult-centric benchmarks that presuppose broad world knowledge, complex instruction following, and mature pragmatic competence.
- Flatter Tokens are More Valuable for Speculative Draft Model Training
Jiaming Fan, Daming Cao, Xiangzhong Luo, Jiale Fu, Chonghan Liu · Jan 26, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Representational Homomorphism Predicts and Improves Compositional Generalization In Transformer Language Model
Zhiyu An, Wan Du · Jan 26, 2026 · Citations: 0
- Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang · Jan 26, 2026 · Citations: 0
We demonstrate the efficacy of our method on multiple mathematical reasoning benchmarks, achieving 8-12x token efficiency compared to reinforcement learning methods such as GRPO and superior performance over off-policy distillation methods.