- MOSAIC: Multi-agent Orchestration for Task-Intelligent Scientific Coding
Siddeshwar Raghavan, Tanwi Mallick · Oct 9, 2025 · Citations: 0
Multi Agent
We present MOSAIC, a multi-agent Large Language Model (LLM) framework for solving challenging scientific coding tasks.
- How Reliable is Language Model Micro-Benchmarking?
Gregory Yauney, Shahzaib Saqib Warraich, Swabha Swayamdipta · Oct 9, 2025 · Citations: 0
Pairwise Preference
We introduce a meta-evaluation measure for micro-benchmarking which investigates how well a micro-benchmark can rank two models as a function of their performance difference on the full benchmark.
- How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective
Xianzhen Luo, Jinyang Huang, Wenzhen Zheng, Qingfu Zhu, Mingzheng Xu · Oct 9, 2025 · Citations: 0
Existing benchmarks often evaluate the exclusion ratio on large, unstructured collections of wrong codes, suffering from high computational costs and score inflation.
- Towards Unified World Models for Visual Navigation via Memory-Augmented Planning and Foresight
Yifei Dong, Fengyi Wu, Guangyu Chen, Lingdong Kong, Xu Zhu · Oct 9, 2025 · Citations: 0
Long Horizon
Enabling embodied agents to imagine future states is essential for robust and generalizable visual navigation.
- DeepPrune: Parallel Scaling without Inter-trace Redundancy
Shangqing Tu, Yaxuan Li, Yushi Bai, Lei Hou, Juanzi Li · Oct 9, 2025 · Citations: 0
Our method features a specialized judge model trained with out-of-distribution data (AIME 2022, AIME 2023, and MATH 500) using oversampling techniques to accurately predict answer equivalence from partial reasoning traces, achieving 0.7072…
- If Probable, Then Acceptable? Understanding Conditional Acceptability Judgments in Large Language Models
Jasmin Orth, Philipp Mondorf, Barbara Plank · Oct 9, 2025 · Citations: 0
When humans evaluate how acceptable a conditional "If A, then B" is, their judgments are influenced by two main factors: the conditional probability of B given A, and the semantic relevance of the antecedent A given the consequent B (i.e.,…
- Augmenting Rating-Scale Measures with Text-Derived Items Using the Information-Determined Scoring (IDS) Framework
Joe Watson, Ivan O'Connor, Chia-Wen Chen, Luning Sun, Fang Luo · Oct 9, 2025 · Citations: 0
Rubric Rating
This marks a conceptual departure from traditional automated text scoring by prioritising information gain over fidelity to expert rubrics or human-annotated data.
- Emotionally Charged, Logically Blurred: AI-driven Emotional Framing Impairs Human Fallacy Detection
Yanran Chen, Lynn Greschner, Roman Klinger, Michael Klenk, Steffen Eger · Oct 9, 2025 · Citations: 0
We benchmark eight LLMs on injecting emotional appeal into fallacious arguments while preserving their logical structures, then use the best models to generate stimuli for a human study.
- Counterfactual Identifiability via Dynamic Optimal Transport
Fabio De Sousa Ribeiro, Ainkaran Santhirasekaram, Ben Glocker · Oct 9, 2025 · Citations: 0
- Neuron-Level Analysis of Cultural Understanding in Large Language Models
Taisei Yamamoto, Ryoma Kumon, Danushka Bollegala, Hitomi Yanaka · Oct 9, 2025 · Citations: 0
We validate their role by showing that suppressing them substantially degrades performance on cultural benchmarks (by up to 30%), while performance on general natural language understanding (NLU) benchmarks remains largely unaffected.
- NavSpace: How Navigation Agents Follow Spatial Intelligence Instructions
Haolin Yang, Yuxing Long, Zhuoyuan Yu, Zihan Yang, Minghan Wang · Oct 9, 2025 · Citations: 0
Long Horizon
Prior benchmarks mainly focus on semantic understanding but overlook systematically evaluating navigation agents' spatial perception and reasoning capabilities.
- Understanding Temporal Logic Consistency in Video-Language Models through Cross-Modal Attention Discriminability
Chengzhi Li, Heyan Huang, Ping Jian, Zhen Yang, Yaning Tian · Oct 9, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Lossless Vocabulary Reduction for Auto-Regressive Language Models
Daiki Chijiwa, Taku Hasegawa, Kyosuke Nishida, Shin'ya Yamaguchi, Tomoya Ohba · Oct 9, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Everything is Plausible: Investigating the Impact of LLM Rationales on Human Notions of Plausibility
Shramay Palta, Peter Rankel, Sarah Wiegreffe, Rachel Rudinger · Oct 9, 2025 · Citations: 0
We investigate the degree to which human plausibility judgments of multiple-choice commonsense benchmark answers are subject to influence by (im)plausibility arguments for or against an answer, in particular, using rationales generated by…
- Fewer Weights, More Problems: A Practical Attack on LLM Pruning
Kazuki Egashira, Robin Staab, Thibaud Gloaguen, Mark Vero, Martin Vechev · Oct 9, 2025 · Citations: 0
Red Team
We demonstrate the severity of our attack through extensive evaluation on five models; after any of the pruning in vLLM are applied (Magnitude, Wanda, and SparseGPT), it consistently exhibits strong malicious behaviors in a diverse set of…
- A Systematic Evaluation of Self-Supervised Learning for Label-Efficient Sleep Staging with Wearable EEG
Emilio Estevan, María Sierra-Torralba, Eduardo López-Larraz, Luis Montesano · Oct 9, 2025 · Citations: 0
- TTOM: Test-Time Optimization and Memorization for Compositional Video Generation
Leigang Qu, Ziyang Wang, Na Zheng, Wenjie Wang, Liqiang Nie · Oct 9, 2025 · Citations: 0
- ACE: Attribution-Controlled Knowledge Editing for Multi-hop Factual Recall
Jiayu Yang, Yuxuan Fan, Songning Lai, Shengen Wu, Jiaqi Tang · Oct 9, 2025 · Citations: 0
- Standard-to-Dialect Transfer Trends Differ across Text and Speech: A Case Study on Intent and Topic Classification in German Dialects
Verena Blaschke, Miriam Winkler, Barbara Plank · Oct 9, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- AdaSwitch: Balancing Exploration and Guidance in Knowledge Distillation via Adaptive Switching
Jingyu Peng, Maolin Wang, Hengyi Cai, Yuchen Li, Kai Zhang · Oct 9, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Mitigating Over-Refusal in Aligned Large Language Models via Inference-Time Activation Energy
Eric Hanchen Jiang, Weixuan Ou, Run Liu, Shengyuan Pang, Guancheng Wan · Oct 9, 2025 · Citations: 0
Red Team
Safety alignment of large language models currently faces a central challenge: existing alignment techniques often prioritize mitigating responses to harmful prompts at the expense of overcautious behavior, leading models to incorrectly…
- RCPU: Rotation-Constrained Error Compensation for Structured Pruning of Large Language Models
Shuichiro Haruta, Kazunori Matsumoto, Zhi Li, Yanan Wang, Mori Kurokawa · Oct 9, 2025 · Citations: 0