- Batch Speculative Decoding Done Right
Ranran Haoran Zhang, Soumik Dey, Ashirbad Mishra, Hansi Wu, Binbin Li · Oct 26, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- EchoMind: An Interrelated Multi-level Benchmark for Evaluating Empathetic Speech Language Models
Li Zhou, Lutong Yu, You Lyu, Yihang Lin, Zefeng Zhao · Oct 26, 2025 · Citations: 0
Existing benchmarks typically evaluate linguistic, acoustic, reasoning, or dialogue abilities in isolation, overlooking the integration of these skills that is crucial for human-like, emotionally intelligent conversation.
- Low-Resource Dialect Adaptation of Large Language Models: A French Dialect Case-Study
Eeham Khan, Firas Saidani, Owen Van Esbroeck, Richard Khoury, Leila Kosseim · Oct 26, 2025 · Citations: 0
- REVISION:Reflective Intent Mining and Online Reasoning Auxiliary for E-commerce Visual Search System Optimization
Yiwen Tang, Qiuyu Zhao, Zenghui Sun, Jinsong Lan, Xiaoyong Zhu · Oct 26, 2025 · Citations: 0
Critique Edit
To alleviate the issue, we propose a novel framework REVISION.
- Rule-Based Explanations for Retrieval-Augmented LLM Systems
Joel Rorseth, Parke Godfrey, Lukasz Golab, Divesh Srivastava, Jarek Szlichta · Oct 26, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Towards Scalable Oversight via Partitioned Human Supervision
Ren Yin, Takashi Ishida, Masashi Sugiyama · Oct 26, 2025 · Citations: 0
As artificial intelligence (AI) systems approach and surpass expert human performance across a broad range of tasks, obtaining high-quality human supervision for evaluation and training becomes increasingly challenging.
- VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations
Yupeng Xie, Zhiyang Zhang, Yifan Wu, Sirong Lu, Jiayi Zhang · Oct 25, 2025 · Citations: 0
To address this, we propose VisJudge-Bench, the first comprehensive benchmark for evaluating MLLMs' performance in assessing visualization aesthetics and quality.
- WAON: Large-Scale Japanese Image-Text Pair Dataset for Improving Model Performance on Japanese Cultural Tasks
Issa Sugiura, Shuhei Kurita, Yusuke Oda, Daisuke Kawahara, Yasuo Okabe · Oct 25, 2025 · Citations: 0
To improve the quality and reliability of evaluation on Japanese cultural tasks, we also construct WAON-Bench, a manually curated benchmark for Japanese cultural image classification comprising 374 classes, which addresses issues in the…
- From Slides to Chatbots: Enhancing Large Language Models with University Course Materials
Tu Anh Dinh, Philipp Nicolas Schumacher, Jan Niehues · Oct 25, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- DETECT: Determining Ease and Textual Clarity of German Text Simplifications
Maria Korobeynikova, Alessia Battisti, Lukas Fischer, Yingqiang Gao · Oct 25, 2025 · Citations: 0
Current evaluation of German automatic text simplification (ATS) relies on general-purpose metrics such as SARI, BLEU, and BERTScore, which insufficiently capture simplification quality in terms of simplicity, meaning preservation, and…
- ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality
Shayne Longpre, Sneha Kudugunta, Niklas Muennighoff, I-Hung Hsu, Isaac Caswell · Oct 24, 2025 · Citations: 0
In this work, we undertake the largest multilingual scaling laws study to date, totaling 774 multilingual training experiments, spanning 10M-8B model parameters, 400+ training languages and 48 evaluation languages.
- Penalizing Length: Uncovering Systematic Bias in Quality Estimation Metrics
Yilin Zhang, Wenda Xu, Zhongtao Liu, Tetsuji Nakagawa, Markus Freitag · Oct 24, 2025 · Citations: 0
Pairwise Preference
Quality Estimation (QE) metrics are vital in machine translation for reference-free evaluation and increasingly serve as selection criteria in data filtering and candidate reranking.
- VisCoder2: Building Multi-Language Visualization Coding Agents
Yuansheng Ni, Songcheng Cai, Xiangchao Chen, Jiarong Liang, Zhiheng Lyu · Oct 24, 2025 · Citations: 0
Large language models (LLMs) have recently enabled coding agents capable of generating, executing, and revising visualization code.
- A Diagnostic Benchmark for Sweden-Related Factual Knowledge
Jenny Kunz · Oct 24, 2025 · Citations: 0
Many Swedish benchmarks are translations of US-centric benchmarks and are therefore not suitable for testing knowledge that is particularly relevant, or even specific, to Sweden.
- Gaze-VLM:Bridging Gaze and VLMs through Attention Regularization for Egocentric Understanding
Anupam Pani, Yanchao Yang · Oct 24, 2025 · Citations: 0
- PARL: Prompt-based Agents for Reinforcement Learning
Yarik Menchaca Resendiz, Roman Klinger · Oct 24, 2025 · Citations: 0
However, limited work evaluates LLMs as agents in reinforcement learning (RL) tasks (e.g., playing games), where learning occurs through interaction with an environment and a reward system.
- Estonian Native Large Language Model Benchmark
Helena Grete Lillepalu, Tanel Alumäe · Oct 24, 2025 · Citations: 0
The availability of LLM benchmarks for the Estonian language is limited, and a comprehensive evaluation comparing the performance of different LLMs on Estonian tasks has yet to be conducted.
- Designing and Evaluating Chain-of-Hints for Scientific Question Answering
Anubhav Jangra, Smaranda Muresan · Oct 24, 2025 · Citations: 0
Pairwise Preference
Using the best performing LLM as the backbone of a quantitative study with 41 participants, we uncover distinct user preferences across hinting strategies, and identify the limitations of automatic evaluation metrics to capture them.
- Support-Contra Asymmetry in LLM Explanations
Avinash Patil · Oct 23, 2025 · Citations: 0
Across three benchmark datasets-WIKIONTOLOGY, AG NEWS, and IMDB-we observe a consistent empirical pattern that we term support-contra asymmetry.
- Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation
Yuhan Liu, Lianhui Qin, Shengjie Wang · Oct 23, 2025 · Citations: 0
- Shoot First, Ask Questions Later? Building Rational Agents that Explore and Act Like People
Gabriel Grand, Valerio Pepe, Jacob Andreas, Joshua B. Tenenbaum · Oct 23, 2025 · Citations: 0
Drawing on insights from human cognition, we develop methods to evaluate and enhance agentic information-seeking.
- Co-Designing Quantum Codes with Transversal Diagonal Gates via Multi-Agent Systems
Xi He, Sirui Lu, Bei Zeng · Oct 23, 2025 · Citations: 0
Multi Agent
We address this gap by extending TeXRA with an independent Lean 4 verification layer, turning it into a human-guided multi-agent platform for exact scientific discovery.
- Transferable Graph Learning for Transmission Congestion Management via Busbar Splitting
Ali Rajaei, Peter Palensky, Jochen L. Cremer · Oct 23, 2025 · Citations: 0
- Automated Coding of Communication Data Using ChatGPT: Consistency Across Subgroups
Jiangang Hao, Wenju Cui, Patrick Kyllonen, Emily Kerzabi · Oct 23, 2025 · Citations: 0
Rubric Rating
Prior research has established that ChatGPT can be directly instructed with coding rubrics to code the communication data and achieves accuracy comparable to human raters.
- GlobalRAG: Enhancing Global Reasoning in Multi-hop Question Answering via Reinforcement Learning
Jinchang Luo, Mingquan Cheng, Fan Wan, Ni Li, Xiaoling Xia · Oct 23, 2025 · Citations: 0
Long Horizon
Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that GlobalRAG significantly outperforms strong baselines while using only 8k training data (42% of the training data used by strong baselines), achieving…
- Assessing the Political Fairness of Multilingual LLMs: A Case Study based on a 21-way Multiparallel EuroParl Dataset
Paul Lerner, François Yvon · Oct 23, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- RELOOP: Recursive Retrieval with Multi-Hop Reasoner and Planners for Heterogeneous QA
Ruiyi Yang, Hao Xue, Imran Razzak, Hakim Hacid, Flora D. Salim · Oct 23, 2025 · Citations: 0
Long Horizon
A Head Agent provides guidance that leads retrieval, while an Iteration Agent selects and expands HSeq via structure-respecting actions (e.g., parent/child hops, table row/column neighbors, KG relations); Finally the head agent composes…
- Robust Preference Alignment via Directional Neighborhood Consensus
Ruochen Mao, Yuling Shi, Xiaodong Gu, Jiaheng Wei · Oct 23, 2025 · Citations: 0
Pairwise Preference
To address this challenge, we introduce Robust Preference Selection (RPS), a post-hoc, training-free method by leveraging directional neighborhood consensus.
- Steering Evaluation-Aware Language Models to Act Like They Are Deployed
Tim Tian Hua, Andrew Qin, Samuel Marks, Neel Nanda · Oct 23, 2025 · Citations: 0
- Evaluating Latent Knowledge of Public Tabular Datasets in Large Language Models
Matteo Silvestri, Fabiano Veglianti, Flavio Giorgi, Fabrizio Silvestri, Gabriele Tolomei · Oct 23, 2025 · Citations: 0
In contrast, we propose a framework for assessing contamination in tabular datasets by generating controlled queries and performing comparative evaluation.
- Citation Failure: Definition, Analysis and Efficient Mitigation
Jan Buchmann, Iryna Gurevych · Oct 23, 2025 · Citations: 0
- CreativityPrism: A Holistic Evaluation Framework for Large Language Model Creativity
Zhaoyi Joey Hou, Bowei Alvin Zhang, Yining Lu, Bhiman Kumar Baghel, Anneliese Brei · Oct 23, 2025 · Citations: 0
Creativity is often seen as a hallmark of human intelligence.
- Communication to Completion: Modeling Collaborative Workflows with Intelligent Multi-Agent Communication
Yiming Lu, Xun Wang, Simin Ma, Shujian Liu, Sathish Reddy Indurthi · Oct 22, 2025 · Citations: 0
Multi Agent
Multi-agent LLM systems have demonstrated impressive capabilities in complex collaborative tasks, yet most frameworks treat communication as instantaneous and free, overlooking a fundamental constraint in real world teamwork, collaboration…
- A Tutorial on Cognitive Biases in Agentic AI-Driven 6G Autonomous Networks
Hatim Chergui, Farhad Rezazadeh, Merouane Debbah, Christos Verikoukis · Oct 22, 2025 · Citations: 0
- Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning
Xichen Zhang, Sitong Wu, Yinghao Zhu, Haoru Tan, Shaozuo Yu · Oct 22, 2025 · Citations: 0
- ToolDreamer: Instilling LLM Reasoning Into Tool Retrievers
Saptarshi Sengupta, Zhengyu Zhou, Jun Araki, Xingbo Wang, Bingqing Wang · Oct 22, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- DiffAdapt: Difficulty-Adaptive Reasoning for Token-Efficient LLM Inference
Xiang Liu, Xuming Hu, Xiaowen Chu, Eunsol Choi · Oct 22, 2025 · Citations: 0
- A Foundational Theory of Quantitative Abstraction: Adjunctions, Duality, and Logic for Probabilistic Systems
Nivar Anwer, Ezequiel López-Rubio, David Elizondo, Rafael M. Luque-Baena · Oct 22, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- LLM Unlearning with LLM Beliefs
Kemou Li, Qizhou Wang, Yue Wang, Fengpeng Li, Jun Liu · Oct 22, 2025 · Citations: 0
Extensive experiments across diverse benchmarks with various model families confirm the effectiveness of our approach.
- Difficulty-Controllable Multiple-Choice Question Generation Using Large Language Models and Direct Preference Optimization
Yuto Tomikawa, Masaki Uto · Oct 22, 2025 · Citations: 0
Pairwise Preference
To address these limitations, this study proposes a novel difficulty-controllable multiple-choice question generation method for reading comprehension which leverages a large language model trained using a direct preference optimization…
- Modality Matching Matters: Calibrating Language Distances for Cross-Lingual Transfer in URIEL+
York Hay Ng, Aditya Khan, Xiang Lu, Matteo Salloum, Michael Zhou · Oct 22, 2025 · Citations: 0
Across multiple zero-shot transfer benchmarks, we demonstrate that our representations significantly improve transfer performance when the distance type is relevant to the task, while our composite distance yields gains in most tasks.
- A Multi-faceted Analysis of Cognitive Abilities: Evaluating Prompt Methods with Large Language Models on the CONSORT Checklist
Sohyeon Jeon, Hyung-Chul Lee · Oct 22, 2025 · Citations: 0
Despite the rapid expansion of Large Language Models (LLMs) in healthcare, robust and explainable evaluation of their ability to assess clinical trial reporting according to CONSORT standards remains an open challenge.
- PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions
Amith Ananthram, Elias Stengel-Eskin, Lorena A. Bradford, Julia Demarest, Adam Purvis · Oct 21, 2025 · Citations: 0
Rubric Rating
In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g.
- Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
Haochen Wang, Yuhao Wang, Tao Zhang, Yikang Zhou, Yanwei Li · Oct 21, 2025 · Citations: 0
Moreover, we construct GAR-Bench, which not only provides a more accurate evaluation of single-region comprehension, but also, more importantly, measures interactions and complex reasoning across multiple regions.
- LightMem: Lightweight and Efficient Memory-Augmented Generation
Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang · Oct 21, 2025 · Citations: 0
Tool Use
Inspired by the Atkinson-Shiffrin model of human memory, LightMem organizes memory into three complementary stages.
- See the Text: From Tokenization to Visual Reading
Ling Xing, Rui Yan, Alex Jinpeng Wang, Zechao Li, Jinhui Tang · Oct 21, 2025 · Citations: 0
Humans read by recognizing words as visual objects, including their shapes, layouts, and patterns, before connecting them to meaning, which enables us to handle typos, distorted fonts, and various scripts effectively.
- A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning
Mengqi Li, Lei Zhao, Anthony Man-Cho So, Ruoyu Sun, Xiao Li · Oct 21, 2025 · Citations: 0
Across six math reasoning benchmarks, SePT improves a strong no-training baseline, defined as the untuned base model evaluated at its best swept decoding temperature, on several tested models.
- Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views
Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xufang Luo, Mingze Sun · Oct 21, 2025 · Citations: 0
- CEFR-Annotated WordNet: LLM-Based Proficiency-Guided Semantic Database for Language Learning
Masato Kikuchi, Masatsugu Ono, Toshioki Soga, Tetsu Tanabe, Tadachika Ozono · Oct 21, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- KrishokBondhu: A Retrieval-Augmented Voice-Based Agricultural Advisory Call Center for Bengali Farmers
Mohd Ruhul Ameen, Akif Islam, Farjana Aktar, M. Saifuzzaman Rafat · Oct 21, 2025 · Citations: 0
In a pilot evaluation, KrishokBondhu produced high-quality responses for 72.7% of diverse agricultural queries.
- MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation
Chengshu Li, Mengdi Xu, Arpit Bahety, Hang Yin, Yunfan Jiang · Oct 21, 2025 · Citations: 0
Demonstrations Long Horizon
Imitation learning from large-scale, diverse human demonstrations has been shown to be effective for training robots, but collecting such data is costly and time-consuming.
- Contrastive Decoding Mitigates Score Range Bias in LLM-as-a-Judge
Yoshinari Fujinuma · Oct 21, 2025 · Citations: 0
One such challenge is using LLMs-as-judges for direct assessment, i.e., assigning scores from a specified range without any references.
- Latent-Augmented Discrete Diffusion Models
Dario Shariatian, Alain Durmus, Umut Simsekli, Stefano Peluchetti · Oct 20, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Chain-of-Thought Reasoning Improves Context-Aware Translation with Large Language Models
Shabnam Ataee, Hugo Huart, Andrei Popescu-Belis · Oct 20, 2025 · Citations: 0
We use the English-French DiscEvalMT benchmark (Bawden et al., 2018) with pairs of sentences containing translation challenges for pronominal anaphora and lexical cohesion.
- SPACeR: Self-Play Anchoring with Centralized Reference Models
Wei-Jer Chang, Akshay Rangesh, Kevin Joseph, Matthew Strong, Masayoshi Tomizuka · Oct 20, 2025 · Citations: 0
Demonstrations Multi Agent
Developing autonomous vehicles (AVs) requires not only safety and efficiency, but also realistic, human-like behaviors that are socially aware and predictable.
- Is Multilingual LLM Watermarking Truly Multilingual? Scaling Robustness to 100+ Languages via Back-Translation
Asim Mohamed, Martin Gubri · Oct 20, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- DELULU: Discriminative Embedding Learning Using Latent Units for Speaker-Aware Self-Trained Speech Foundational Model
Massa Baali, Rita Singh, Bhiksha Raj · Oct 20, 2025 · Citations: 0
DELULU significantly outperforms prior SSL models across a range of speaker-centric tasks, achieving up to 62\% relative improvement in equal error rate (EER) for speaker verification and consistent gains on zero-shot profiling tasks…
- Towards a Practical Understanding of Lagrangian Methods in Safe Reinforcement Learning
Lindsay Spoor, Álvaro Serra-Gómez, Aske Plaat, Thomas Moerland · Oct 20, 2025 · Citations: 0
Safe reinforcement learning addresses constrained optimization problems where maximizing performance must be balanced against safety constraints, and Lagrangian methods are a widely used approach for this purpose.
- Annotation-Efficient Universal Honesty Alignment
Shiyu Ni, Keping Bi, Jiafeng Guo, Minghao Tang, Jingtong Wu · Oct 20, 2025 · Citations: 0
To support a large-scale study, we release HonestyBench, a benchmark covering ten free-form QA datasets with 560k training and 70k evaluation instances annotated with correctness and self-consistency signals.
- StreamingThinker: Large Language Models Can Think While Reading
Junlong Tong, Yingqi Fan, Anhao Zhao, Yunpu Ma, Xiaoyu Shen · Oct 20, 2025 · Citations: 0
Inspired by human cognition of thinking while reading, we first design a streaming thinking paradigm for LLMs, where reasoning unfolds in the order of input and further adjusts its depth once reading is complete.