- Diversity or Precision? A Deep Dive into Next Token Prediction
Haoyuan Wu, Hai Wang, Jiajia Wu, Jinxiang Ou, Keyao Wang · Dec 28, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Beg to Differ: Understanding Reasoning-Answer Misalignment Across Languages
Anaelia Ovalle, Candace Ross, Sebastian Ruder, Adina Williams, Karen Ullrich · Dec 27, 2025 · Citations: 0
We introduce a human-validated framework to evaluate whether model-generated reasoning traces logically support their conclusions across languages.
- Syntactic Framing Fragility: An Audit of Robustness in LLM Ethical Decisions
Katherine Elkins, Jon Chun · Dec 27, 2025 · Citations: 0
Negation-bearing syntax is the dominant failure mode, with some models endorsing actions at 80-97% rates even when asked whether agents not act.
- Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds
Naman Agarwal, Siddhartha R. Dalal, Vishal Misra · Dec 27, 2025 · Citations: 0
- Geometric Scaling of Bayesian Inference in LLMs
Naman Agarwal, Siddhartha R. Dalal, Vishal Misra · Dec 27, 2025 · Citations: 0
- The Bayesian Geometry of Transformer Attention
Naman Agarwal, Siddhartha R. Dalal, Vishal Misra · Dec 27, 2025 · Citations: 0
- Hallucination Detection and Evaluation of Large Language Model
Chenggong Zhang, Haopeng Wang, Hexi Meng · Dec 27, 2025 · Citations: 0
To address this, we integrate the Hughes Hallucination Evaluation Model (HHEM), a lightweight classification-based framework that operates independently of LLM-based judgments, significantly improving efficiency while maintaining high…
- Intrinsic-Metric Physics-Informed Neural Networks (IM-PINN) for Reaction-Diffusion Dynamics on Complex Riemannian Manifolds
Julian Evan Chrisnanto, Salsabila Rahma Alia, Nurfauzi Fadillah, Yulison Herry Chrisnanto · Dec 26, 2025 · Citations: 0
Benchmarking against the Surface Finite Element Method (SFEM) reveals superior physical rigor: the IM-PINN achieves global mass conservation error of E_{mass} \approx 0.157 versus SFEM's 0.258, acting as a thermodynamically consistent…
- CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics
Vaibhav Devraj, Dhruv Kumar, Jagat Sesh Challa, Parth Agarwal, Navya Kommuri · Dec 26, 2025 · Citations: 0
Expert Verification
To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data.
- Ara-HOPE: Human-Centric Post-Editing Evaluation for Dialectal Arabic to Modern Standard Arabic Translation
Abdullah Alabdullah, Lifeng Han, Chenghua Lin · Dec 25, 2025 · Citations: 0
Existing automatic evaluation metrics and general-purpose human evaluation frameworks struggle to capture dialect-specific MT errors, hindering progress in translation assessment.
- Measuring all the noises of LLM Evals
Sida Wang · Dec 24, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Parallel Token Prediction for Language Models
Felix Draxler, Justus Will, Farrin Marouf Sofian, Theofanis Karaletsos, Sameer Singh · Dec 24, 2025 · Citations: 0
- Schrödinger's Navigator: Imagining an Ensemble of Futures for Zero-Shot Object Navigation
Yu He, Da Huang, Zhenyang Liu, Zixiao Gu, Qiang Sun · Dec 24, 2025 · Citations: 0
- Semantic Refinement with LLMs for Graph Representations
Safal Thapaliya, Zehong Wang, Jiazheng Li, Ziming Li, Yanfang Ye · Dec 24, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Agentic Explainable Artificial Intelligence (Agentic XAI) Approach To Explore Better Explanation
Tomoaki Yamaguchi, Yutong Zhou, Masahiro Ryo, Keisuke Katsura · Dec 24, 2025 · Citations: 0
- Where Did This Sentence Come From? Tracing Provenance in LLM Reasoning Distillation
Kaiyuan Liu, Shaotian Yan, Rui Miao, Bing Wang, Chen Shen · Dec 24, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Large Language Models Approach Expert Pedagogical Quality in Math Tutoring but Differ in Instructional and Linguistic Profiles
Ramatu Oiza Abdulsalam, Segun Aroyehun · Dec 23, 2025 · Citations: 0
Recent work has explored the use of large language models (LLMs) to generate tutoring responses in mathematics, yet it remains unclear how closely their instructional behavior aligns with expert human practice.
- DIAL: Direct Iterative Adversarial Learning for Realistic Multi-Turn Dialogue Simulation
Ziyi Zhu, Olivier Tieleman, Caitlin A. Stamatis, Luka Smyth, Thomas D. Hull · Dec 23, 2025 · Citations: 0
Realistic user simulation is crucial for training and evaluating multi-turn dialogue systems, yet creating simulators that accurately replicate human behavior remains a significant challenge.
- Generalization of RLVR Using Causal Reasoning as a Testbed
Brian Lu, Hongyu Zhao, Shuo Sun, Hao Peng, Rui Ding · Dec 23, 2025 · Citations: 0
- AgentMath: Empowering Mathematical Reasoning for Large Language Models via Tool-Augmented Agent
Haipeng Luo, Huawen Feng, Qingfeng Sun, Can Xu, Kai Zheng · Dec 23, 2025 · Citations: 0
In this work, we present AgentMath, an agent framework that seamlessly integrates language models' reasoning capabilities with code interpreters' computational precision to efficiently tackle complex mathematical problems.
- Coherence in the brain unfolds across separable temporal regimes
Davide Staub, Finn Rabe, Akhil Misra, Yves Pauli, Roya Hüppi · Dec 23, 2025 · Citations: 0
- Reason2Decide: Rationale-Driven Multi-Task Learning
H M Quamran Hasan, Housam Khalifa Bashier, Jiayi Dai, Mi-Young Kim, Randy Goebel · Dec 23, 2025 · Citations: 0
Across model sizes, Reason2Decide outperforms other fine-tuning baselines and some zero-shot LLMs in prediction (F1) and rationale fidelity (BERTScore, BLEU, LLM-as-a-Judge).
- Geometric Organization of Cognitive States in Transformer Embedding Spaces
Sophie Zhao · Dec 23, 2025 · Citations: 0
- Neuron-Guided Interpretation of Code LLMs: Where, Why, and How?
Zhe Yin, Xiaodong Gu, Beijun Shen · Dec 23, 2025 · Citations: 0
- Machine Unlearning in the Era of Quantum Machine Learning: An Empirical Study
Carla Crivoi, Radu Tudor Ionescu · Dec 22, 2025 · Citations: 0
- CycleChart: A Unified Consistency-Based Learning Framework for Bidirectional Chart Understanding and Generation
Dazhen Deng, Sen Yang, Yuchen He, Yuan Tian, Yingcai Wu · Dec 22, 2025 · Citations: 0
To support this framework, we construct CycleChart-Bench, a lifecycle-aligned benchmark where every chart sample carries aligned annotations for generation, schema parsing, data parsing, and question answering.
- On the Existence and Behavior of Secondary Attention Sinks
Jeffrey T. H. Wong, Cheng Zhang, Louis Mahon, Wayne Luk, Anton Isopoussu · Dec 22, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Stop saying LLM: Large Discourse Models (LDM) and Artificial Discursive Agent (ADA)?
Amar Lakel · Dec 22, 2025 · Citations: 0
This paper proposes an epistemological shift in the analysis of large generative models, replacing the category ''Large Language Models'' (LLM) with that of ''Large Discourse Models'' (LDM), and then with that of Artificial Discursive Agent…
- Training-Free Global Geometric Association for 4D LiDAR Panoptic Segmentation
Gyeongrok Oh, Youngdong Jang, Jonghyun Choi, Suk-Ju Kang, Guang Lin · Dec 22, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.