- RAIR: A Rule-Aware Benchmark Uniting Challenging Long-Tail and Visual Salience Subset for E-commerce Relevance Assessment
Chenji Lu, Zhuo Chen, Hui Zhao, Zhenyi Wang, Pengjie Wang · Dec 31, 2025
While large language models (LLMs) have shown significant results on relevance task, existing benchmarks lack sufficient complexity for comprehensive model assessment, resulting in an absence of standardized relevance evaluation metrics acr
- WISE: Web Information Satire and Fakeness Evaluation
Gaurab Chhetri, Subasish Das, Tausif Islam Chowdhury · Dec 30, 2025
This study develops WISE (Web Information Satire and Fakeness Evaluation) framework which benchmarks eight lightweight transformer models alongside two baseline models on a balanced dataset of 20,000 samples from Fakeddit, annotated as eith
- Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss
Ang Lv, Jin Ma, Yiyuan Ma, Siyuan Qiao · Dec 29, 2025
Mixture-of-Experts (MoE) models lack explicit constraints to ensure the router's decisions align well with the experts' capabilities, which ultimately limits model performance.
- CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics
Vaibhav Devraj, Dhruv Kumar, Jagat Sesh Challa, Parth Agarwal, Navya Kommuri · Dec 26, 2025
Expert Verification
To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data.
- Where Did This Sentence Come From? Tracing Provenance in LLM Reasoning Distillation
Kaiyuan Liu, Shaotian Yan, Rui Miao, Bing Wang, Chen Shen · Dec 24, 2025
Reasoning distillation has attracted increasing attention.
- DIAL: Direct Iterative Adversarial Learning for Realistic Multi-Turn Dialogue Simulation
Ziyi Zhu, Olivier Tieleman, Caitlin A. Stamatis, Luka Smyth, Thomas D. Hull · Dec 23, 2025
Realistic user simulation is crucial for training and evaluating multi-turn dialogue systems, yet creating simulators that accurately replicate human behavior remains a significant challenge.
- On the Existence and Behavior of Secondary Attention Sinks
Jeffrey T. H. Wong, Cheng Zhang, Louis Mahon, Wayne Luk, Anton Isopoussu · Dec 22, 2025
Attention sinks are tokens, often the beginning-of-sequence (BOS) token, that receive disproportionately high attention despite limited semantic relevance.
- Stop saying LLM: Large Discourse Models (LDM) and Artificial Discursive Agent (ADA)?
Amar Lakel · Dec 22, 2025
This paper proposes an epistemological shift in the analysis of large generative models, replacing the category ''Large Language Models'' (LLM) with that of ''Large Discourse Models'' (LDM), and then with that of Artificial Discursive Agent
- Towards Efficient Agents: A Co-Design of Inference Architecture and System
Weizhe Lin, Hui-Ling Zhen, Shuai Yang, Xian Wang, Renxi Liu · Dec 20, 2025
Long Horizon
The rapid development of large language model (LLM)-based agents has unlocked new possibilities for autonomous multi-turn reasoning and tool-augmented decision-making.
- Knowledge Distillation with Structured Chain-of-Thought for Text-to-SQL
Khushboo Thaker, Yony Bresler · Dec 18, 2025
Deploying accurate Text-to-SQL systems at the enterprise level faces a difficult trilemma involving cost, security and performance.
- In-Context Algebra
Eric Todd, Jannik Brinkmann, Rohit Gandikota, David Bau · Dec 18, 2025
We investigate the mechanisms that arise when transformers are trained to solve arithmetic on sequences where tokens are variables whose meaning is determined only through their interactions in-context.
- Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics
Iker García-Ferrero, David Montero, Roman Orus · Dec 18, 2025
Red Team
We replace fragile pattern-based refusal detection with an LLM-as-a-judge that assigns refusal confidence scores and we propose a ridge-regularized variant to compute steering vectors that better isolate the refusal--compliance direction.
- A Domain-Adapted Pipeline for Structured Information Extraction from Police Incident Announcements on Social Media
Mengfan Shen, Kangqi Song, Xindi Wang, Wei Jia, Tao Wang · Dec 18, 2025
Structured information extraction from police incident announcements is crucial for timely and accurate data processing, yet presents considerable challenges due to the variability and informal nature of textual sources such as social media
- Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution
Jonathan Kamp, Roos Bakker, Dominique Blok · Dec 11, 2025
Pairwise Preference
In this work, we delve beyond the superficial inconsistencies between attribution methods, structuring their biases through a model- and method-agnostic framework of three evaluation metrics.
- Interpreto: An Explainability Library for Transformers
Antonin Poché, Thomas Mullor, Gabriele Sarti, Frédéric Boisnard, Corentin Friedrich · Dec 10, 2025
Interpreto is an open-source Python library for interpreting HuggingFace language models, from early BERT variants to LLMs.
- KD-OCT: Efficient Knowledge Distillation for Clinical-Grade Retinal OCT Classification
Erfan Nourbakhsh, Nasrin Sanjari, Ali Nourbakhsh · Dec 9, 2025
Age-related macular degeneration (AMD) and choroidal neovascularization (CNV)-related conditions are leading causes of vision loss worldwide, with optical coherence tomography (OCT) serving as a cornerstone for early detection and managemen
- QSTN: A Modular Framework for Robust Questionnaire Inference with Large Language Models
Maximilian Kreutner, Jens Rupprecht, Georg Ahnert, Ahmed Salem, Markus Strohmaier · Dec 9, 2025
QSTN enables robust evaluation of questionnaire presentation, prompt perturbations, and response generation methods.
- Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning
Huilin Xu, Zhuoyang Liu, Yixiang Luomei, Feng Xu · Dec 9, 2025
Long Horizon
Extensive experiments on the AerialVLN and OpenFly benchmark validate the effectiveness of our method.
- Group Representational Position Encoding
Yifan Zhang, Zixiang Chen, Yifeng Liu, Zhen Qin, Huizhuo Yuan · Dec 8, 2025
We present GRAPE (Group Representational Position Encoding), a unified framework for positional encoding based on group actions.
- STaRR: Spatial-Temporal Token-Dynamics-Aware Responsive Remasking for Diffusion Language Models
Xinhao Sun, Huaijin Zhao, Maoliang Li, Zihao Zheng, Jiayu Chen · Dec 7, 2025
Diffusion Language Models (DLMs) enable parallel decoding via iterative denoising, where remasking strategies play a critical role in balancing inference speed and output quality.
- Conflict-Aware Fusion: Resolving Logic Inertia in Large Language Models via Structured Cognitive Priors
Qiming Bao, Xiaoxuan Fu, Michael Witbrock · Dec 6, 2025
Long Horizon
We present a controlled evaluation framework consisting of four stress tests: (1) rule deletion (redundant vs.
- Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers
Hongzhan Lin, Zhiqi Bai, Xinmiao Zhang, Sen Yang, Xiang Li · Dec 3, 2025
Transformer decoders have achieved strong results across tasks, but the memory required for the KV cache becomes prohibitive at long sequence lengths.
- AITutor-EvalKit: Exploring the Capabilities of AI Tutors
Numaan Naeem, Kaushal Kumar Maurya, Kseniia Petukhova, Ekaterina Kochmar · Dec 3, 2025
Demonstrations
We present AITutor-EvalKit, an application that uses language technology to evaluate the pedagogical quality of AI tutors, provides software for demonstration and evaluation, as well as model inspection and data visualization.
- Randomized Masked Finetuning: An Efficient Way to Mitigate Memorization of PIIs in LLMs
Kunj Joshi, David A. Smith · Dec 2, 2025
We present MaxTER, a Pareto-optimal evaluation framework for assessing privacy-utility tradeoffs, and show the performance of RMFT vs Deduplication by Area Under The Response Curve (AURC) metric.
- Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks
Songwen Zhao, Danqing Wang, Kexun Zhang, Jiaxuan Luo, Zhuo Li · Dec 2, 2025
Vibe coding is a new programming paradigm in which human engineers instruct large language model (LLM) agents to complete complex coding tasks with little supervision.
- From Moderation to Mediation: Can LLMs Serve as Mediators in Online Flame Wars?
Dawei Li, Abdullah Alnaibari, Arslan Bisharat, Manny Sandoval, Deborah Hall · Dec 2, 2025
To assess mediation quality, we construct a large Reddit-based dataset and propose a multi-stage evaluation pipeline combining principle-based scoring, user simulation, and human comparison.
- promptolution: A Unified, Modular Framework for Prompt Optimization
Tom Zehle, Timo Heiß, Moritz Schlager, Matthias Aßenmacher, Matthias Feurer · Dec 2, 2025
It integrates multiple contemporary discrete prompt optimizers, supports systematic and reproducible benchmarking, and returns framework-agnostic prompt strings, enabling seamless integration into existing LLM pipelines while remaining agno
- BOOM: Beyond Only One Modality KIT's Multimodal Multilingual Lecture Companion
Sai Koneru, Fabian Retkowski, Christian Huber, Lukas Hilgert, Seymanur Akti · Dec 2, 2025
The globalization of education and rapid growth of online learning have made localizing educational content a critical challenge.
- PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models
Robert Belanec, Ivan Srba, Maria Bielikova · Dec 2, 2025
While its modular design supports extensibility, it natively provides a representative set of 19 PEFT methods, 27 classification and text generation datasets addressing 12 tasks, and both standard and PEFT-specific evaluation metrics.
- Cross-Lingual Interleaving for Speech Language Models
Adel Moumen, Guangzhi Sun, Philip C. Woodland · Dec 1, 2025
However, progress has been largely English-centric due to scarce spoken evaluation benchmarks and training data, making cross-lingual learning difficult.