- DUAL-Bench: Measuring Over-Refusal and Robustness in Vision-Language Models
Kaixuan Ren, Preslav Nakov, Usman Naseem · Oct 12, 2025 · Citations: 0
As vision-language models (VLMs) become increasingly capable, maintaining a balance between safety and usefulness remains a central challenge.
- Happiness is Sharing a Vocabulary: A Study of Transliteration Methods
Haeji Jung, Jinju Kim, Kyungjin Kim, Youjeong Roh, David R. Mortensen · Oct 12, 2025 · Citations: 0
We evaluate each model on three downstream tasks -- named entity recognition (NER), part-of-speech tagging (POS) and natural language inference (NLI) -- and find that romanization significantly outperforms other input types in 11 out of 12…
- FactAppeal: Identifying Epistemic Factual Appeals in News Media
Guy Mor-Lan, Tamir Sheafer, Shaul R. Shenhav · Oct 12, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Detecting Hallucinations in Authentic LLM-Human Interactions
Yujie Ren, Niklas Gruhlke, Anne Lauscher · Oct 12, 2025 · Citations: 0
To address this limitation, we introduce AuthenHallu, the first hallucination detection benchmark built entirely from authentic LLM-human interactions.
- Personalized Motion Guidance Framework for Athlete-Centric Coaching
Ryota Takamido, Chiharu Suzuki, Hiroki Nakamoto · Oct 12, 2025 · Citations: 0
- FML-bench: Benchmarking Machine Learning Agents for Scientific Research
Qiran Zou, Hou Hei Lam, Wenhao Zhao, Yiming Tang, Tingting Chen · Oct 12, 2025 · Citations: 0
To more comprehensively evaluate agents in scientific research settings, we introduce FML-bench, a benchmark comprising 8 diverse and fundamental ML research tasks, and further propose complementary metrics, notably Exploration Diversity,…
- CQA-Eval: Designing Reliable Evaluations of Multi-paragraph Clinical QA under Resource Constraints
Federica Bologna, Tiffany Pan, Matthew Wilkens, Yue Guo, Lucy Lu Wang · Oct 12, 2025 · Citations: 0
Evaluating multi-paragraph clinical question answering (QA) systems is resource-intensive and challenging: accurate judgments require medical expertise and achieving consistent human judgments over multi-paragraph text is difficult.
- EvoEdit: Evolving Null-space Alignment for Robust and Efficient Knowledge Editing
Sicheng Lyu, Yu Gu, Xinyu Wang, Jerry Huang, Sitao Luan · Oct 11, 2025 · Citations: 0
Evaluations on real-world sequential knowledge-editing benchmarks show that EvoEdit achieves better or comparable performance than prior state-of-the-art locate-then-edit techniques, with up to 3.53 times speedup.
- Language steering in latent space to mitigate unintended code-switching
Andrey Goncharov, Nikolai Kondusov, Alexey Zaytsev · Oct 11, 2025 · Citations: 0
Generation-based evaluation on Llama-3.2 further demonstrates 63--99\% reduction in Code-Switching Index across four language pairs (p < 0.001).
- You only need 4 extra tokens: Synergistic Test-time Adaptation for LLMs
Yijie Xu, Huizai Yao, Zhiyu Guo, Pengteng Li, Aiwei Liu · Oct 11, 2025 · Citations: 0
Across diverse model architectures and domain-specific benchmarks, SyTTA delivers consistent gains.
- CLMN: Concept based Language Models via Neural Symbolic Reasoning
Yibo Yang · Oct 11, 2025 · Citations: 0
Concept bottleneck models tie predictions to human concepts in vision, but NLP versions either use binary activations that harm text representations or latent concepts that weaken semantics, and they rarely model dynamic concept…
- Mapping Semantic & Syntactic Relationships with Geometric Rotation
Michael Freenor, Lauren Alvarez · Oct 10, 2025 · Citations: 0
Demonstrations
We introduce Rotor-Invariant Shift Estimation (RISE), a geometric approach that represents semantic-syntactic transformations as consistent rotational operations in embedding space, leveraging the manifold structure of modern language…
- GraphMERT: Efficient and Scalable Distillation of Reliable Knowledge Graphs from Unstructured Data
Margarita Belova, Jiaxin Xiao, Shikhar Tuli, Niraj K. Jha · Oct 10, 2025 · Citations: 0
Expert Verification
GraphMERT + KG is the first efficient and scalable neurosymbolic model to achieve state-of-the-art benchmark accuracy along with superior symbolic representations relative to baselines.
- The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach
Nizar El Ghazal, Antoine Caubrière, Valentin Vielzeuf · Oct 10, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Chlorophyll-a Mapping and Prediction in the Mar Menor Lagoon Using C2RCC-Processed Sentinel 2 Imagery
Antonio Martínez-Ibarra, Aurora González-Vidal, Adrián Cánovas-Rodríguez, Antonio F. Skarmeta · Oct 10, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- ATLAS: Adaptive Trading with LLM AgentS Through Dynamic Prompt Optimization and Multi-Agent Coordination
Charidimos Papadakis, Angeliki Dimitriou, Giorgos Filandrianos, Maria Lymperaiou, Konstantinos Thomas · Oct 10, 2025 · Citations: 0
- Verifying Chain-of-Thought Reasoning via Its Computational Graph
Zheng Zhao, Yeskendir Koishekenov, Xianjun Yang, Naila Murray, Nicola Cancedda · Oct 10, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- MaP: A Unified Framework for Reliable Evaluation of Pre-training Dynamics
Jiapeng Wang, Changxin Tian, Kunlong Chen, Ziqi Liu, Jiaxin Mao · Oct 10, 2025 · Citations: 0
Reliable evaluation is fundamental to the progress of Large Language Models (LLMs), yet the evaluation process during pre-training is plagued by significant instability that obscures true learning dynamics.
- Detecting Data Contamination from Reinforcement Learning Post-training for Large Language Models
Yongding Tao, Tian Wang, Yihong Dong, Huanyu Liu, Kechi Zhang · Oct 10, 2025 · Citations: 0
Critique Edit
Data contamination poses a significant threat to the reliable evaluation of Large Language Models (LLMs).
- DSPO: Stable and Efficient Policy Optimization for Agentic Search and Reasoning
Chenyang Gu, Yewen Pu, Bruce Yang, Xiaofan Li, Huan Gao · Oct 10, 2025 · Citations: 0
Demonstrations
Current approaches either rely on prompting to elicit the model's innate agent capabilities, or suffer from performance ceilings and collapse when applying RL to complex interactive tasks, leaving their true agentic potential untapped.
- Clear Roads, Clear Vision: Advancements in Multi-Weather Restoration for Smart Transportation
Vijay M. Galshetwar, Praful Hambarde, Prashant W. Patil, Akshay Dudhane, Sachin Chaudhary · Oct 10, 2025 · Citations: 0
- Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs
Yumin Choi, Dongki Kim, Jinheon Baek, Sung Ju Hwang · Oct 10, 2025 · Citations: 0
To tackle this problem, we then propose the Multimodal Prompt Optimizer (MPO), a unified framework that not only performs the joint optimization of multimodal prompts through alignment-preserving updates but also guides the selection…
- A Linguistics-Aware LLM Watermarking via Syntactic Predictability
Shinwoo Park, Hyejin Park, Hyeseon Ahn, Yo-Sub Han · Oct 10, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Do LLMs Really Know What They Don't Know? Internal States Mainly Reflect Knowledge Recall Rather Than Truthfulness
Chi Seng Cheang, Hou Pong Chan, Wenxuan Zhang, Yang Deng · Oct 10, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Beyond Prefixes: Graph-as-Memory Cross-Attention for Knowledge Graph Completion with Large Language Models
Ruitong Liu, Boxu Lin, Peize Li, Siyuan Li, Yunjia Wu · Oct 10, 2025 · Citations: 0
- Autoencoding-Free Context Compression for LLMs via Contextual Semantic Anchors
Xin Liu, Runsong Zhao, Pengcheng Huang, Xinyu Liu, Junyi Xiao · Oct 10, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs
Yan Wang, Keyi Wang, Shanshan Yang, Jaisal Patel, Jeff Zhao · Oct 10, 2025 · Citations: 0
We introduce FinAuditing, a taxonomy-aligned, structure-aware benchmark built from real XBRL filings.
- MOSAIC: Multi-agent Orchestration for Task-Intelligent Scientific Coding
Siddeshwar Raghavan, Tanwi Mallick · Oct 9, 2025 · Citations: 0
Multi Agent
We present MOSAIC, a multi-agent Large Language Model (LLM) framework for solving challenging scientific coding tasks.
- How Reliable is Language Model Micro-Benchmarking?
Gregory Yauney, Shahzaib Saqib Warraich, Swabha Swayamdipta · Oct 9, 2025 · Citations: 0
Pairwise Preference
We introduce a meta-evaluation measure for micro-benchmarking which investigates how well a micro-benchmark can rank two models as a function of their performance difference on the full benchmark.
- How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective
Xianzhen Luo, Jinyang Huang, Wenzhen Zheng, Qingfu Zhu, Mingzheng Xu · Oct 9, 2025 · Citations: 0
Existing benchmarks often evaluate the exclusion ratio on large, unstructured collections of wrong codes, suffering from high computational costs and score inflation.
- Towards Unified World Models for Visual Navigation via Memory-Augmented Planning and Foresight
Yifei Dong, Fengyi Wu, Guangyu Chen, Lingdong Kong, Xu Zhu · Oct 9, 2025 · Citations: 0
Long Horizon
Enabling embodied agents to imagine future states is essential for robust and generalizable visual navigation.
- If Probable, Then Acceptable? Understanding Conditional Acceptability Judgments in Large Language Models
Jasmin Orth, Philipp Mondorf, Barbara Plank · Oct 9, 2025 · Citations: 0
When humans evaluate how acceptable a conditional "If A, then B" is, their judgments are influenced by two main factors: the conditional probability of B given A, and the semantic relevance of the antecedent A given the consequent B (i.e.,…
- Augmenting Rating-Scale Measures with Text-Derived Items Using the Information-Determined Scoring (IDS) Framework
Joe Watson, Ivan O'Connor, Chia-Wen Chen, Luning Sun, Fang Luo · Oct 9, 2025 · Citations: 0
Rubric Rating
This marks a conceptual departure from traditional automated text scoring by prioritising information gain over fidelity to expert rubrics or human-annotated data.
- Emotionally Charged, Logically Blurred: AI-driven Emotional Framing Impairs Human Fallacy Detection
Yanran Chen, Lynn Greschner, Roman Klinger, Michael Klenk, Steffen Eger · Oct 9, 2025 · Citations: 0
We benchmark eight LLMs on injecting emotional appeal into fallacious arguments while preserving their logical structures, then use the best models to generate stimuli for a human study.
- Counterfactual Identifiability via Dynamic Optimal Transport
Fabio De Sousa Ribeiro, Ainkaran Santhirasekaram, Ben Glocker · Oct 9, 2025 · Citations: 0
- Neuron-Level Analysis of Cultural Understanding in Large Language Models
Taisei Yamamoto, Ryoma Kumon, Danushka Bollegala, Hitomi Yanaka · Oct 9, 2025 · Citations: 0
We validate their role by showing that suppressing them substantially degrades performance on cultural benchmarks (by up to 30%), while performance on general natural language understanding (NLU) benchmarks remains largely unaffected.
- NavSpace: How Navigation Agents Follow Spatial Intelligence Instructions
Haolin Yang, Yuxing Long, Zhuoyuan Yu, Zihan Yang, Minghan Wang · Oct 9, 2025 · Citations: 0
Long Horizon
Prior benchmarks mainly focus on semantic understanding but overlook systematically evaluating navigation agents' spatial perception and reasoning capabilities.
- Understanding Temporal Logic Consistency in Video-Language Models through Cross-Modal Attention Discriminability
Chengzhi Li, Heyan Huang, Ping Jian, Zhen Yang, Yaning Tian · Oct 9, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Lossless Vocabulary Reduction for Auto-Regressive Language Models
Daiki Chijiwa, Taku Hasegawa, Kyosuke Nishida, Shin'ya Yamaguchi, Tomoya Ohba · Oct 9, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Everything is Plausible: Investigating the Impact of LLM Rationales on Human Notions of Plausibility
Shramay Palta, Peter Rankel, Sarah Wiegreffe, Rachel Rudinger · Oct 9, 2025 · Citations: 0
We investigate the degree to which human plausibility judgments of multiple-choice commonsense benchmark answers are subject to influence by (im)plausibility arguments for or against an answer, in particular, using rationales generated by…
- Fewer Weights, More Problems: A Practical Attack on LLM Pruning
Kazuki Egashira, Robin Staab, Thibaud Gloaguen, Mark Vero, Martin Vechev · Oct 9, 2025 · Citations: 0
Red Team
We demonstrate the severity of our attack through extensive evaluation on five models; after any of the pruning in vLLM are applied (Magnitude, Wanda, and SparseGPT), it consistently exhibits strong malicious behaviors in a diverse set of…
- A Systematic Evaluation of Self-Supervised Learning for Label-Efficient Sleep Staging with Wearable EEG
Emilio Estevan, María Sierra-Torralba, Eduardo López-Larraz, Luis Montesano · Oct 9, 2025 · Citations: 0
- TTOM: Test-Time Optimization and Memorization for Compositional Video Generation
Leigang Qu, Ziyang Wang, Na Zheng, Wenjie Wang, Liqiang Nie · Oct 9, 2025 · Citations: 0
- ACE: Attribution-Controlled Knowledge Editing for Multi-hop Factual Recall
Jiayu Yang, Yuxuan Fan, Songning Lai, Shengen Wu, Jiaqi Tang · Oct 9, 2025 · Citations: 0
- AdaSwitch: Balancing Exploration and Guidance in Knowledge Distillation via Adaptive Switching
Jingyu Peng, Maolin Wang, Hengyi Cai, Yuchen Li, Kai Zhang · Oct 9, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Mitigating Over-Refusal in Aligned Large Language Models via Inference-Time Activation Energy
Eric Hanchen Jiang, Weixuan Ou, Run Liu, Shengyuan Pang, Guancheng Wan · Oct 9, 2025 · Citations: 0
Red Team
Safety alignment of large language models currently faces a central challenge: existing alignment techniques often prioritize mitigating responses to harmful prompts at the expense of overcautious behavior, leading models to incorrectly…
- RCPU: Rotation-Constrained Error Compensation for Structured Pruning of Large Language Models
Shuichiro Haruta, Kazunori Matsumoto, Zhi Li, Yanan Wang, Mori Kurokawa · Oct 9, 2025 · Citations: 0
- PATCH: Mitigating PII Leakage in Language Models with Privacy-Aware Targeted Circuit PatcHing
Anthony Hughes, Vasisht Duddu, N. Asokan, Nikolaos Aletras, Ning Ma · Oct 8, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- LAD-RAG: Layout-aware Dynamic RAG for Visually-Rich Document Understanding
Zhivar Sourati, Zheng Wang, Marianne Menglin Liu, Yazhe Hu, Mengqing Guo · Oct 8, 2025 · Citations: 0
- EconCausal: A Context-Aware Causal Reasoning Benchmark for Large Language Models in Social Science
Donggyu Lee, Hyeok Yun, Meeyoung Cha, Sungwon Park, Sangyoon Park · Oct 8, 2025 · Citations: 0
To address this, we introduce EconCausal, a large-scale benchmark comprising 10,490 context-annotated causal triplets extracted from 2,595 high-quality empirical studies published in top-tier economics and finance journals.
- Biasless Language Models Learn Unnaturally: How LLMs Fail to Distinguish the Possible from the Impossible
Imry Ziv, Nur Lan, Emmanuel Chemla · Oct 8, 2025 · Citations: 0
Are large language models (LLMs) sensitive to the distinction between humanly possible and impossible languages?
- Search-R3: Unifying Reasoning and Embedding in Large Language Models
Yuntao Gui, James Cheng · Oct 8, 2025 · Citations: 0
Our extensive evaluations on diverse benchmarks demonstrate that Search-R3 significantly outperforms prior methods by unifying the reasoning and embedding generation processes.
- Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation
Vaibhav Srivastav, Steven Zheng, Eric Bezzam, Eustache Le Bihan, Nithin Rao Koluguri · Oct 8, 2025 · Citations: 0
We present the Open ASR Leaderboard, a reproducible benchmarking platform with community contributions from academia and industry.
- Multi-hop Deep Joint Source-Channel Coding with Deep Hash Distillation for Semantically Aligned Image Recovery
Didrik Bergström, Deniz Gündüz, Onur Günlü · Oct 8, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Exposing Citation Vulnerabilities in Generative Engines
Riku Mochizuki, Shusuke Komatsu, Souta Noguchi, Kazuto Ataka · Oct 8, 2025 · Citations: 0
- FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline
Haotian Wu, Shufan Jiang, Chios Chen, Yiyang Feng, Hehai Lin · Oct 8, 2025 · Citations: 0
Multi Agent
As large language models (LLMs) advance in role-playing (RP) tasks, existing benchmarks quickly become obsolete due to their narrow scope, outdated interaction paradigms, and limited adaptability across diverse application scenarios.
- PTEB: Towards Robust Text Embedding Evaluation via Stochastic Paraphrasing at Evaluation Time with LLMs
Manuel Frank, Haithem Afli · Oct 8, 2025 · Citations: 0
- PIKA: Expert-Level Synthetic Datasets for Post-Training Alignment from Scratch
Shangjian Yin, Shining Liang, Wenbiao Ding, Yuli Qian, Zhouxing Shi · Oct 8, 2025 · Citations: 0
Pairwise Preference
Despite its small size, fine-tuning Llama-3-8B-Base on PiKa-SFT even outperforms the official Llama-3-8B-Instruct model trained on over 10M proprietary examples on widely used benchmarks such as AlpacaEval 2.0 and Arena-Hard.
- StaR-KVQA: Structured Reasoning Traces for Implicit-Knowledge Visual Question Answering
Zhihao Wen, Wenkang Wei, Yuan Fang, Xingtong Yu, Hui Zhang · Oct 8, 2025 · Citations: 0
- Protecting De-identified Documents from Search-based Linkage Attacks
Pierre Lison, Mark Anderson · Oct 7, 2025 · Citations: 0