- Diversity or Precision? A Deep Dive into Next Token Prediction
Haoyuan Wu, Hai Wang, Jiajia Wu, Jinxiang Ou, Keyao Wang · Dec 28, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Beg to Differ: Understanding Reasoning-Answer Misalignment Across Languages
Anaelia Ovalle, Candace Ross, Sebastian Ruder, Adina Williams, Karen Ullrich · Dec 27, 2025 · Citations: 0
We introduce a human-validated framework to evaluate whether model-generated reasoning traces logically support their conclusions across languages.
- Syntactic Framing Fragility: An Audit of Robustness in LLM Ethical Decisions
Katherine Elkins, Jon Chun · Dec 27, 2025 · Citations: 0
Negation-bearing syntax is the dominant failure mode, with some models endorsing actions at 80-97% rates even when asked whether agents not act.
- Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds
Naman Agarwal, Siddhartha R. Dalal, Vishal Misra · Dec 27, 2025 · Citations: 0
- Geometric Scaling of Bayesian Inference in LLMs
Naman Agarwal, Siddhartha R. Dalal, Vishal Misra · Dec 27, 2025 · Citations: 0
- The Bayesian Geometry of Transformer Attention
Naman Agarwal, Siddhartha R. Dalal, Vishal Misra · Dec 27, 2025 · Citations: 0
- Hallucination Detection and Evaluation of Large Language Model
Chenggong Zhang, Haopeng Wang, Hexi Meng · Dec 27, 2025 · Citations: 0
To address this, we integrate the Hughes Hallucination Evaluation Model (HHEM), a lightweight classification-based framework that operates independently of LLM-based judgments, significantly improving efficiency while maintaining high…
- Intrinsic-Metric Physics-Informed Neural Networks (IM-PINN) for Reaction-Diffusion Dynamics on Complex Riemannian Manifolds
Julian Evan Chrisnanto, Salsabila Rahma Alia, Nurfauzi Fadillah, Yulison Herry Chrisnanto · Dec 26, 2025 · Citations: 0
Benchmarking against the Surface Finite Element Method (SFEM) reveals superior physical rigor: the IM-PINN achieves global mass conservation error of E_{mass} \approx 0.157 versus SFEM's 0.258, acting as a thermodynamically consistent…
- CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics
Vaibhav Devraj, Dhruv Kumar, Jagat Sesh Challa, Parth Agarwal, Navya Kommuri · Dec 26, 2025 · Citations: 0
Expert Verification
To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data.
- Ara-HOPE: Human-Centric Post-Editing Evaluation for Dialectal Arabic to Modern Standard Arabic Translation
Abdullah Alabdullah, Lifeng Han, Chenghua Lin · Dec 25, 2025 · Citations: 0
Existing automatic evaluation metrics and general-purpose human evaluation frameworks struggle to capture dialect-specific MT errors, hindering progress in translation assessment.
- Measuring all the noises of LLM Evals
Sida Wang · Dec 24, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Parallel Token Prediction for Language Models
Felix Draxler, Justus Will, Farrin Marouf Sofian, Theofanis Karaletsos, Sameer Singh · Dec 24, 2025 · Citations: 0
- Schrödinger's Navigator: Imagining an Ensemble of Futures for Zero-Shot Object Navigation
Yu He, Da Huang, Zhenyang Liu, Zixiao Gu, Qiang Sun · Dec 24, 2025 · Citations: 0
- Semantic Refinement with LLMs for Graph Representations
Safal Thapaliya, Zehong Wang, Jiazheng Li, Ziming Li, Yanfang Ye · Dec 24, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Agentic Explainable Artificial Intelligence (Agentic XAI) Approach To Explore Better Explanation
Tomoaki Yamaguchi, Yutong Zhou, Masahiro Ryo, Keisuke Katsura · Dec 24, 2025 · Citations: 0
- Where Did This Sentence Come From? Tracing Provenance in LLM Reasoning Distillation
Kaiyuan Liu, Shaotian Yan, Rui Miao, Bing Wang, Chen Shen · Dec 24, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Large Language Models Approach Expert Pedagogical Quality in Math Tutoring but Differ in Instructional and Linguistic Profiles
Ramatu Oiza Abdulsalam, Segun Aroyehun · Dec 23, 2025 · Citations: 0
Recent work has explored the use of large language models (LLMs) to generate tutoring responses in mathematics, yet it remains unclear how closely their instructional behavior aligns with expert human practice.
- DIAL: Direct Iterative Adversarial Learning for Realistic Multi-Turn Dialogue Simulation
Ziyi Zhu, Olivier Tieleman, Caitlin A. Stamatis, Luka Smyth, Thomas D. Hull · Dec 23, 2025 · Citations: 0
Realistic user simulation is crucial for training and evaluating multi-turn dialogue systems, yet creating simulators that accurately replicate human behavior remains a significant challenge.
- Generalization of RLVR Using Causal Reasoning as a Testbed
Brian Lu, Hongyu Zhao, Shuo Sun, Hao Peng, Rui Ding · Dec 23, 2025 · Citations: 0
- AgentMath: Empowering Mathematical Reasoning for Large Language Models via Tool-Augmented Agent
Haipeng Luo, Huawen Feng, Qingfeng Sun, Can Xu, Kai Zheng · Dec 23, 2025 · Citations: 0
In this work, we present AgentMath, an agent framework that seamlessly integrates language models' reasoning capabilities with code interpreters' computational precision to efficiently tackle complex mathematical problems.
- Reason2Decide: Rationale-Driven Multi-Task Learning
H M Quamran Hasan, Housam Khalifa Bashier, Jiayi Dai, Mi-Young Kim, Randy Goebel · Dec 23, 2025 · Citations: 0
Across model sizes, Reason2Decide outperforms other fine-tuning baselines and some zero-shot LLMs in prediction (F1) and rationale fidelity (BERTScore, BLEU, LLM-as-a-Judge).
- Geometric Organization of Cognitive States in Transformer Embedding Spaces
Sophie Zhao · Dec 23, 2025 · Citations: 0
- Neuron-Guided Interpretation of Code LLMs: Where, Why, and How?
Zhe Yin, Xiaodong Gu, Beijun Shen · Dec 23, 2025 · Citations: 0
- Machine Unlearning in the Era of Quantum Machine Learning: An Empirical Study
Carla Crivoi, Radu Tudor Ionescu · Dec 22, 2025 · Citations: 0
- CycleChart: A Unified Consistency-Based Learning Framework for Bidirectional Chart Understanding and Generation
Dazhen Deng, Sen Yang, Yuchen He, Yuan Tian, Yingcai Wu · Dec 22, 2025 · Citations: 0
To support this framework, we construct CycleChart-Bench, a lifecycle-aligned benchmark where every chart sample carries aligned annotations for generation, schema parsing, data parsing, and question answering.
- On the Existence and Behavior of Secondary Attention Sinks
Jeffrey T. H. Wong, Cheng Zhang, Louis Mahon, Wayne Luk, Anton Isopoussu · Dec 22, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Stop saying LLM: Large Discourse Models (LDM) and Artificial Discursive Agent (ADA)?
Amar Lakel · Dec 22, 2025 · Citations: 0
This paper proposes an epistemological shift in the analysis of large generative models, replacing the category ''Large Language Models'' (LLM) with that of ''Large Discourse Models'' (LDM), and then with that of Artificial Discursive Agent…
- Training-Free Global Geometric Association for 4D LiDAR Panoptic Segmentation
Gyeongrok Oh, Youngdong Jang, Jonghyun Choi, Suk-Ju Kang, Guang Lin · Dec 22, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- From Word to World: Can Large Language Models be Implicit Text-based World Models?
Yixia Li, Hongru Wang, Jiahao Qiu, Zhenfei Yin, Dongdong Zhang · Dec 21, 2025 · Citations: 0
Long Horizon
Agentic reinforcement learning increasingly relies on experience-driven scaling, yet real-world environments remain non-adaptive, limited in coverage, and difficult to scale.
- Adaptive Accountability in Networked MAS: Tracing and Mitigating Emergent Norms at Scale
Saad Alqithami · Dec 21, 2025 · Citations: 0
- NASTaR: NovaSAR Automated Ship Target Recognition Dataset
Benyamin Hosseiny, Kamirul Kamirul, Odysseas Pappas, Alin Achim · Dec 20, 2025 · Citations: 0
- Exploration vs. Fixation: Scaffolding Divergent and Convergent Thinking for Human-AI Co-Creation with Generative Models
Chao Wen, Tung Phung, Pronita Mehrotra, Sumit Gulwani, Roger E. Beaty · Dec 20, 2025 · Citations: 0
We examine an approach grounded in the Geneplore model of creative cognition and instantiate it in a human-AI co-creation system, HAICo, for creative image generation.
- Towards Efficient Agents: A Co-Design of Inference Architecture and System
Weizhe Lin, Hui-Ling Zhen, Shuai Yang, Xian Wang, Renxi Liu · Dec 20, 2025 · Citations: 0
Long Horizon
The rapid development of large language model (LLM)-based agents has unlocked new possibilities for autonomous multi-turn reasoning and tool-augmented decision-making.
- DEER: A Benchmark for Evaluating Deep Research Agents on Expert Report Generation
Janghoon Han, Heegyu Kim, Changho Lee, Dahm Lee, Min Hyung Park · Dec 19, 2025 · Citations: 0
Rubric RatingExpert Verification Long Horizon
However, evaluating such reports remains challenging: report quality is multifaceted, making it difficult to determine what to assess and by what criteria; LLM-based judges may miss errors that require domain expertise to identify; and…
- Affect, Body, Cognition, Demographics, and Emotion: The ABCDE of Text Features for Computational Affective Science
Jan Philip Wahle, Krishnapriya Vishnubhotla, Bela Gipp, Saif M. Mohammad · Dec 19, 2025 · Citations: 0
ABCDE facilitates interdisciplinary research across numerous fields, including affective science, cognitive science, the digital humanities, sociology, political science, and computational linguistics.
- RadImageNet-VQA: A Large-Scale CT and MRI Dataset for Radiologic Visual Question Answering
Léo Butsanets, Charles Corbière, Julien Khlaut, Pierre Manceron, Corentin Dancette · Dec 19, 2025 · Citations: 0
The full dataset and benchmark are publicly available at https://huggingface.co/datasets/raidium/RadImageNet-VQA.
- Value Under Ignorance in Universal Artificial Intelligence
Cole Wyeth, Marcus Hutter · Dec 18, 2025 · Citations: 0
We generalize the AIXI reinforcement learning agent to admit a wider class of utility functions.
- Knowledge Distillation with Structured Chain-of-Thought for Text-to-SQL
Khushboo Thaker, Yony Bresler · Dec 18, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- SFBD-OMNI: Bridge models for lossy measurement restoration with limited clean samples
Haoye Lu, Yaoliang Yu, Darren Lo · Dec 18, 2025 · Citations: 0
- Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning
Qihao Liu, Luoxin Ye, Wufei Ma, Yu-Cheng Chou, Alan Yuille · Dec 18, 2025 · Citations: 0
Pairwise Preference
Across various mathematical benchmarks, the method delivers consistent gains over strong baselines with standard RL post-training.
- In-Context Algebra
Eric Todd, Jannik Brinkmann, Rohit Gandikota, David Bau · Dec 18, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics
Iker García-Ferrero, David Montero, Roman Orus · Dec 18, 2025 · Citations: 0
Red Team
We replace fragile pattern-based refusal detection with an LLM-as-a-judge that assigns refusal confidence scores and we propose a ridge-regularized variant to compute steering vectors that better isolate the refusal--compliance direction.
- TTP: Test-Time Padding for Adversarial Detection and Robust Adaptation on Vision-Language Models
Zhiwei Li, Yitian Pang, Weining Wang, Zhenan Sun, Qi Li · Dec 18, 2025 · Citations: 0
Vision-Language Models (VLMs), such as CLIP, have achieved impressive zero-shot recognition performance but remain highly susceptible to adversarial perturbations, posing significant risks in safety-critical scenarios.
- Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs
Sara Papi, Javier Garcia Gilabert, Zachary Hopton, Vilém Zouhar, Carlos Escolano · Dec 18, 2025 · Citations: 0
We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 6 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems that couple leading speech foundation models (SFM), with multilingual…
- Pretrained battery transformer (PBT): A foundation model for universal battery life prediction
Ruifeng Tan, Weixiang Hong, Jia Li, Jiaqiang Huang, Tong-Yi Zhang · Dec 18, 2025 · Citations: 0
- Agent Tools Orchestration Leaks More: Dataset, Benchmark, and Mitigation
Yuxuan Qiao, Dongqin Liu, Hongchang Yang, Wei Zhou, Songlin Hu · Dec 18, 2025 · Citations: 0
Evaluation of six state-of-the-art LLMs reveals pervasive risk: the average Overall Leakage Rate reaches 62.11% with an H-Score of only 52.90%.
- Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skills
Pengcheng Jiang, Jiacheng Lin, Zhiyi Shi, Zifeng Wang, Luxi He · Dec 18, 2025 · Citations: 0
Pairwise Preference Tool Use
Large language model (LLM) agents are moving beyond prompting alone.
- A Domain-Adapted Pipeline for Structured Information Extraction from Police Incident Announcements on Social Media
Mengfan Shen, Kangqi Song, Xindi Wang, Wei Jia, Tao Wang · Dec 18, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Stepwise Think-Critique: A Unified Framework for Robust and Interpretable LLM Reasoning
Jiaqi Xu, Cuiling Lan, Xuejin Chen, Yan Lu · Dec 17, 2025 · Citations: 0
- Learning continuous state of charge dependent thermal decomposition kinetics for Li-ion cathodes using Kolmogorov-Arnold Chemical Reaction Neural Networks (KA-CRNNs)
Benjamin C. Koenig, Sili Deng · Dec 17, 2025 · Citations: 0
- Physics-driven human-like working memory outperforms digital networks in dynamic vision
Jingli Liu, Huannan Zheng, Bohao Zou, Kezhou Yang · Dec 17, 2025 · Citations: 0
- Enhancing Tree Species Classification: Insights from YOLOv8 and Explainable AI Applied to TLS Point Cloud Projections
Adrian Straker, Paul Magdon, Marco Zullich, Maximilian Freudenberg, Christoph Kleinn · Dec 17, 2025 · Citations: 0
- The Moralization Corpus: Frame-Based Annotation and Analysis of Moralizing Speech Acts across Diverse Text Genres
Maria Becker, Mirko Sommer, Lars Tapken, Yi Wan Teh, Bruno Brocai · Dec 17, 2025 · Citations: 0
Moralizations are pragmatically complex and often implicit, posing significant challenges for both human annotators and NLP systems.
- MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers
Xuanjun Zong, Zhiqi Shen, Lei Wang, Yunshi Lan, Chao Yang · Dec 17, 2025 · Citations: 0
- Imitation Game: Reproducing Deep Learning Bugs Leveraging an Intelligent Agent
Mehil B Shah, Mohammad Masudur Rahman, Foutse Khomh · Dec 17, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs
Jun Zhang, Teng Wang, Yuying Ge, Yixiao Ge, Xinhao Li · Dec 16, 2025 · Citations: 0
We first expose critical quality issues in existing VTG benchmarks and introduce TimeLens-Bench, comprising meticulously re-annotated versions of three popular benchmarks with strict quality criteria.
- A Multicenter Benchmark of Multiple Instance Learning Models for Lymphoma Subtyping from HE-stained Whole Slide Images
Rao Muhammad Umer, Daniel Sens, Jonathan Noll, Sohom Dey, Christian Matek · Dec 16, 2025 · Citations: 0
Deep learning methods could assist pathologists by extracting diagnostic information from routinely available HE-stained slides directly, yet comprehensive benchmarks for lymphoma subtyping on multicenter data are lacking.
- Dual-objective Language Models: Training Efficiency Without Overfitting
David Samuel, Lucas Georges Gabriel Charpentier · Dec 16, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- CSyMR: Benchmarking Compositional Music Information Retrieval in Symbolic Music Reasoning
Boyang Wang, Yash Vishe, Xin Xu, Zachary Novack, Xunyi Jiang · Dec 16, 2025 · Citations: 0
- GRAFT: Grid-Aware Load Forecasting with Multi-Source Textual Alignment and Fusion
Fangzhou Lin, Guoshun He, Zhenyu Guo, Zhe Huang, Jinsong Tao · Dec 16, 2025 · Citations: 0