- Speculative Decoding: Performance or Illusion?
Xiaoxuan Liu, Jiaxiang Yu, Jongseok Park, Ion Stoica, Alvin Cheung · Dec 31, 2025 · Citations: 0
Speculative decoding (SD) has become a popular technique to accelerate Large Language Model (LLM) inference, yet its real-world effectiveness remains unclear as prior evaluations rely on research prototypes and unrealistically small batch…
- The Agentic Leash: Extracting Causal Feedback Fuzzy Cognitive Maps with LLMs
Akash Kumar Panda, Olaoluwa Adigun, Bart Kosko · Dec 31, 2025 · Citations: 0
We design a large-language-model (LLM) agent system that extracts causal feedback fuzzy cognitive maps (FCMs) from raw text.
- RAIR: A Rule-Aware Benchmark Uniting Challenging Long-Tail and Visual Salience Subset for E-commerce Relevance Assessment
Chenji Lu, Zhuo Chen, Hui Zhao, Zhenyi Wang, Pengjie Wang · Dec 31, 2025 · Citations: 0
While large language models (LLMs) have shown significant results on relevance task, existing benchmarks lack sufficient complexity for comprehensive model assessment, resulting in an absence of standardized relevance evaluation metrics…
- ADOPT: Adaptive Dependency-Guided Joint Prompt Optimization for Multi-Step LLM Pipelines
Minjun Zhao, Xinyu Zhang, Shuai Zhang, Deyang Li, Ruifeng Shi · Dec 31, 2025 · Citations: 0
Long Horizon
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem
Weixun Wang, XiaoXiao Xu, Wanhe An, Fangwen Dai, Wei Gao · Dec 31, 2025 · Citations: 0
Long Horizon
We introduce the Agentic Learning Ecosystem (ALE), a foundational infrastructure that optimizes the production pipeline for agentic model.
- Paragraph Segmentation Revisited: Towards a Standard Task for Structuring Speech
Fabian Retkowski, Alexander Waibel · Dec 30, 2025 · Citations: 0
First, we establish TEDPara (human-annotated TED talks) and YTSegPara (YouTube videos with synthetic labels) as the first benchmarks for the paragraph segmentation task.
- Multi-Agent LLMs for Generating Research Limitations
Ibrahim Al Azher, Zhishuai Guo, Hamed Alhoori · Dec 30, 2025 · Citations: 0
Multi Agent
We propose, a multi-agent LLM framework for generating substantive limitations.
- Activation Steering for Masked Diffusion Language Models
Adi Shnaidman, Erin Feiglin, Osher Yaari, Efrat Mentel, Amit Levi · Dec 30, 2025 · Citations: 0
Using safety refusal as a deployment-relevant case study, we find that refusal behavior in multiple MDLMs is governed by a consistent, approximately one-dimensional activation subspace.
- WISE: Web Information Satire and Fakeness Evaluation
Gaurab Chhetri, Subasish Das, Tausif Islam Chowdhury · Dec 30, 2025 · Citations: 0
This study develops WISE (Web Information Satire and Fakeness Evaluation) framework which benchmarks eight lightweight transformer models alongside two baseline models on a balanced dataset of 20,000 samples from Fakeddit, annotated as…
- VL-RouterBench: A Benchmark for Vision-Language Model Routing
Zhehao Huang, Baijiong Lin, Jingyuan Zhang, Jingying Wang, Yuhang Liu · Dec 29, 2025 · Citations: 0
The evaluation protocol jointly measures average accuracy, average cost, and throughput, and builds a ranking score from the harmonic mean of normalized cost and accuracy to enable comparison across router configurations and cost budgets.
- Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss
Ang Lv, Jin Ma, Yiyuan Ma, Siyuan Qiao · Dec 29, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Diversity or Precision? A Deep Dive into Next Token Prediction
Haoyuan Wu, Hai Wang, Jiajia Wu, Jinxiang Ou, Keyao Wang · Dec 28, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Beg to Differ: Understanding Reasoning-Answer Misalignment Across Languages
Anaelia Ovalle, Candace Ross, Sebastian Ruder, Adina Williams, Karen Ullrich · Dec 27, 2025 · Citations: 0
We introduce a human-validated framework to evaluate whether model-generated reasoning traces logically support their conclusions across languages.
- Syntactic Framing Fragility: An Audit of Robustness in LLM Ethical Decisions
Katherine Elkins, Jon Chun · Dec 27, 2025 · Citations: 0
Negation-bearing syntax is the dominant failure mode, with some models endorsing actions at 80-97% rates even when asked whether agents not act.
- Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds
Naman Agarwal, Siddhartha R. Dalal, Vishal Misra · Dec 27, 2025 · Citations: 0
- Geometric Scaling of Bayesian Inference in LLMs
Naman Agarwal, Siddhartha R. Dalal, Vishal Misra · Dec 27, 2025 · Citations: 0
- The Bayesian Geometry of Transformer Attention
Naman Agarwal, Siddhartha R. Dalal, Vishal Misra · Dec 27, 2025 · Citations: 0
- Hallucination Detection and Evaluation of Large Language Model
Chenggong Zhang, Haopeng Wang, Hexi Meng · Dec 27, 2025 · Citations: 0
To address this, we integrate the Hughes Hallucination Evaluation Model (HHEM), a lightweight classification-based framework that operates independently of LLM-based judgments, significantly improving efficiency while maintaining high…
- Intrinsic-Metric Physics-Informed Neural Networks (IM-PINN) for Reaction-Diffusion Dynamics on Complex Riemannian Manifolds
Julian Evan Chrisnanto, Salsabila Rahma Alia, Nurfauzi Fadillah, Yulison Herry Chrisnanto · Dec 26, 2025 · Citations: 0
Benchmarking against the Surface Finite Element Method (SFEM) reveals superior physical rigor: the IM-PINN achieves global mass conservation error of E_{mass} \approx 0.157 versus SFEM's 0.258, acting as a thermodynamically consistent…
- CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics
Vaibhav Devraj, Dhruv Kumar, Jagat Sesh Challa, Parth Agarwal, Navya Kommuri · Dec 26, 2025 · Citations: 0
Expert Verification
To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data.
- Ara-HOPE: Human-Centric Post-Editing Evaluation for Dialectal Arabic to Modern Standard Arabic Translation
Abdullah Alabdullah, Lifeng Han, Chenghua Lin · Dec 25, 2025 · Citations: 0
Existing automatic evaluation metrics and general-purpose human evaluation frameworks struggle to capture dialect-specific MT errors, hindering progress in translation assessment.
- Measuring all the noises of LLM Evals
Sida Wang · Dec 24, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Parallel Token Prediction for Language Models
Felix Draxler, Justus Will, Farrin Marouf Sofian, Theofanis Karaletsos, Sameer Singh · Dec 24, 2025 · Citations: 0
- Schrödinger's Navigator: Imagining an Ensemble of Futures for Zero-Shot Object Navigation
Yu He, Da Huang, Zhenyang Liu, Zixiao Gu, Qiang Sun · Dec 24, 2025 · Citations: 0
- Semantic Refinement with LLMs for Graph Representations
Safal Thapaliya, Zehong Wang, Jiazheng Li, Ziming Li, Yanfang Ye · Dec 24, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Agentic Explainable Artificial Intelligence (Agentic XAI) Approach To Explore Better Explanation
Tomoaki Yamaguchi, Yutong Zhou, Masahiro Ryo, Keisuke Katsura · Dec 24, 2025 · Citations: 0
- Where Did This Sentence Come From? Tracing Provenance in LLM Reasoning Distillation
Kaiyuan Liu, Shaotian Yan, Rui Miao, Bing Wang, Chen Shen · Dec 24, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Large Language Models Approach Expert Pedagogical Quality in Math Tutoring but Differ in Instructional and Linguistic Profiles
Ramatu Oiza Abdulsalam, Segun Aroyehun · Dec 23, 2025 · Citations: 0
Recent work has explored the use of large language models (LLMs) to generate tutoring responses in mathematics, yet it remains unclear how closely their instructional behavior aligns with expert human practice.
- DIAL: Direct Iterative Adversarial Learning for Realistic Multi-Turn Dialogue Simulation
Ziyi Zhu, Olivier Tieleman, Caitlin A. Stamatis, Luka Smyth, Thomas D. Hull · Dec 23, 2025 · Citations: 0
Realistic user simulation is crucial for training and evaluating multi-turn dialogue systems, yet creating simulators that accurately replicate human behavior remains a significant challenge.
- Generalization of RLVR Using Causal Reasoning as a Testbed
Brian Lu, Hongyu Zhao, Shuo Sun, Hao Peng, Rui Ding · Dec 23, 2025 · Citations: 0
- AgentMath: Empowering Mathematical Reasoning for Large Language Models via Tool-Augmented Agent
Haipeng Luo, Huawen Feng, Qingfeng Sun, Can Xu, Kai Zheng · Dec 23, 2025 · Citations: 0
In this work, we present AgentMath, an agent framework that seamlessly integrates language models' reasoning capabilities with code interpreters' computational precision to efficiently tackle complex mathematical problems.
- Reason2Decide: Rationale-Driven Multi-Task Learning
H M Quamran Hasan, Housam Khalifa Bashier, Jiayi Dai, Mi-Young Kim, Randy Goebel · Dec 23, 2025 · Citations: 0
Across model sizes, Reason2Decide outperforms other fine-tuning baselines and some zero-shot LLMs in prediction (F1) and rationale fidelity (BERTScore, BLEU, LLM-as-a-Judge).
- Geometric Organization of Cognitive States in Transformer Embedding Spaces
Sophie Zhao · Dec 23, 2025 · Citations: 0
- Neuron-Guided Interpretation of Code LLMs: Where, Why, and How?
Zhe Yin, Xiaodong Gu, Beijun Shen · Dec 23, 2025 · Citations: 0
- Machine Unlearning in the Era of Quantum Machine Learning: An Empirical Study
Carla Crivoi, Radu Tudor Ionescu · Dec 22, 2025 · Citations: 0
- CycleChart: A Unified Consistency-Based Learning Framework for Bidirectional Chart Understanding and Generation
Dazhen Deng, Sen Yang, Yuchen He, Yuan Tian, Yingcai Wu · Dec 22, 2025 · Citations: 0
To support this framework, we construct CycleChart-Bench, a lifecycle-aligned benchmark where every chart sample carries aligned annotations for generation, schema parsing, data parsing, and question answering.
- On the Existence and Behavior of Secondary Attention Sinks
Jeffrey T. H. Wong, Cheng Zhang, Louis Mahon, Wayne Luk, Anton Isopoussu · Dec 22, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Stop saying LLM: Large Discourse Models (LDM) and Artificial Discursive Agent (ADA)?
Amar Lakel · Dec 22, 2025 · Citations: 0
This paper proposes an epistemological shift in the analysis of large generative models, replacing the category ''Large Language Models'' (LLM) with that of ''Large Discourse Models'' (LDM), and then with that of Artificial Discursive Agent…
- Training-Free Global Geometric Association for 4D LiDAR Panoptic Segmentation
Gyeongrok Oh, Youngdong Jang, Jonghyun Choi, Suk-Ju Kang, Guang Lin · Dec 22, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- From Word to World: Can Large Language Models be Implicit Text-based World Models?
Yixia Li, Hongru Wang, Jiahao Qiu, Zhenfei Yin, Dongdong Zhang · Dec 21, 2025 · Citations: 0
Long Horizon
Agentic reinforcement learning increasingly relies on experience-driven scaling, yet real-world environments remain non-adaptive, limited in coverage, and difficult to scale.
- Adaptive Accountability in Networked MAS: Tracing and Mitigating Emergent Norms at Scale
Saad Alqithami · Dec 21, 2025 · Citations: 0
- NASTaR: NovaSAR Automated Ship Target Recognition Dataset
Benyamin Hosseiny, Kamirul Kamirul, Odysseas Pappas, Alin Achim · Dec 20, 2025 · Citations: 0
- Exploration vs. Fixation: Scaffolding Divergent and Convergent Thinking for Human-AI Co-Creation with Generative Models
Chao Wen, Tung Phung, Pronita Mehrotra, Sumit Gulwani, Roger E. Beaty · Dec 20, 2025 · Citations: 0
We examine an approach grounded in the Geneplore model of creative cognition and instantiate it in a human-AI co-creation system, HAICo, for creative image generation.
- Towards Efficient Agents: A Co-Design of Inference Architecture and System
Weizhe Lin, Hui-Ling Zhen, Shuai Yang, Xian Wang, Renxi Liu · Dec 20, 2025 · Citations: 0
Long Horizon
The rapid development of large language model (LLM)-based agents has unlocked new possibilities for autonomous multi-turn reasoning and tool-augmented decision-making.
- DEER: A Benchmark for Evaluating Deep Research Agents on Expert Report Generation
Janghoon Han, Heegyu Kim, Changho Lee, Dahm Lee, Min Hyung Park · Dec 19, 2025 · Citations: 0
Rubric RatingExpert Verification Long Horizon
However, evaluating such reports remains challenging: report quality is multifaceted, making it difficult to determine what to assess and by what criteria; LLM-based judges may miss errors that require domain expertise to identify; and…
- Affect, Body, Cognition, Demographics, and Emotion: The ABCDE of Text Features for Computational Affective Science
Jan Philip Wahle, Krishnapriya Vishnubhotla, Bela Gipp, Saif M. Mohammad · Dec 19, 2025 · Citations: 0
ABCDE facilitates interdisciplinary research across numerous fields, including affective science, cognitive science, the digital humanities, sociology, political science, and computational linguistics.
- RadImageNet-VQA: A Large-Scale CT and MRI Dataset for Radiologic Visual Question Answering
Léo Butsanets, Charles Corbière, Julien Khlaut, Pierre Manceron, Corentin Dancette · Dec 19, 2025 · Citations: 0
The full dataset and benchmark are publicly available at https://huggingface.co/datasets/raidium/RadImageNet-VQA.
- Value Under Ignorance in Universal Artificial Intelligence
Cole Wyeth, Marcus Hutter · Dec 18, 2025 · Citations: 0
We generalize the AIXI reinforcement learning agent to admit a wider class of utility functions.
- Knowledge Distillation with Structured Chain-of-Thought for Text-to-SQL
Khushboo Thaker, Yony Bresler · Dec 18, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- SFBD-OMNI: Bridge models for lossy measurement restoration with limited clean samples
Haoye Lu, Yaoliang Yu, Darren Lo · Dec 18, 2025 · Citations: 0
- Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning
Qihao Liu, Luoxin Ye, Wufei Ma, Yu-Cheng Chou, Alan Yuille · Dec 18, 2025 · Citations: 0
Pairwise Preference
Across various mathematical benchmarks, the method delivers consistent gains over strong baselines with standard RL post-training.
- In-Context Algebra
Eric Todd, Jannik Brinkmann, Rohit Gandikota, David Bau · Dec 18, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics
Iker García-Ferrero, David Montero, Roman Orus · Dec 18, 2025 · Citations: 0
Red Team
We replace fragile pattern-based refusal detection with an LLM-as-a-judge that assigns refusal confidence scores and we propose a ridge-regularized variant to compute steering vectors that better isolate the refusal--compliance direction.
- TTP: Test-Time Padding for Adversarial Detection and Robust Adaptation on Vision-Language Models
Zhiwei Li, Yitian Pang, Weining Wang, Zhenan Sun, Qi Li · Dec 18, 2025 · Citations: 0
Vision-Language Models (VLMs), such as CLIP, have achieved impressive zero-shot recognition performance but remain highly susceptible to adversarial perturbations, posing significant risks in safety-critical scenarios.
- Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs
Sara Papi, Javier Garcia Gilabert, Zachary Hopton, Vilém Zouhar, Carlos Escolano · Dec 18, 2025 · Citations: 0
We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 6 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems that couple leading speech foundation models (SFM), with multilingual…
- Pretrained battery transformer (PBT): A foundation model for universal battery life prediction
Ruifeng Tan, Weixiang Hong, Jia Li, Jiaqiang Huang, Tong-Yi Zhang · Dec 18, 2025 · Citations: 0
- Agent Tools Orchestration Leaks More: Dataset, Benchmark, and Mitigation
Yuxuan Qiao, Dongqin Liu, Hongchang Yang, Wei Zhou, Songlin Hu · Dec 18, 2025 · Citations: 0
Evaluation of six state-of-the-art LLMs reveals pervasive risk: the average Overall Leakage Rate reaches 62.11% with an H-Score of only 52.90%.
- Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skills
Pengcheng Jiang, Jiacheng Lin, Zhiyi Shi, Zifeng Wang, Luxi He · Dec 18, 2025 · Citations: 0
Pairwise Preference Tool Use
Large language model (LLM) agents are moving beyond prompting alone.
- A Domain-Adapted Pipeline for Structured Information Extraction from Police Incident Announcements on Social Media
Mengfan Shen, Kangqi Song, Xindi Wang, Wei Jia, Tao Wang · Dec 18, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Stepwise Think-Critique: A Unified Framework for Robust and Interpretable LLM Reasoning
Jiaqi Xu, Cuiling Lan, Xuejin Chen, Yan Lu · Dec 17, 2025 · Citations: 0