- NeuCo-Bench: A Novel Benchmark Framework for Neural Embeddings in Earth Observation
Rikard Vinge, Isabelle Wittmann, Jannik Schneider, Michael Marszalek, Luis Gilch · Oct 19, 2025 · Citations: 0
- CoGate-LSTM: Prototype-Guided Feature-Space Gating for Mitigating Gradient Dilution in Imbalanced Toxic Comment Classification
Noor Islam S. Mohammad · Oct 19, 2025 · Citations: 0
On the Jigsaw Toxic Comment benchmark, CoGate-LSTM achieves 0.881 macro-F1 (95% CI: [0.873, 0.889]) and 96.0% accuracy, outperforming fine-tuned BERT by 6.9 macro-F1 points (p < 0.001) and XGBoost by 4.7, while using only 7.3M parameters…
- SAKE: Towards Editing Auditory Attribute Knowledge of Large Audio-Language Models
Chih-Kai Yang, Yen-Ting Piao, Tzu-Wen Hsu, Szu-Wei Fu, Zhehuai Chen · Oct 19, 2025 · Citations: 0
We introduce SAKE, the first benchmark for editing perceptual auditory attribute knowledge in large audio-language models (LALMs), which requires modifying acoustic generalization rather than isolated facts.
- MA-SAPO: Multi-Agent Reasoning for Score-Aware Prompt Optimization
Wonduk Seo, Juhyeon Lee, Junseo Koh, Wonseok Choi, Hyunjin An · Oct 18, 2025 · Citations: 0
Critique Edit Multi Agent
However, most existing frameworks treat evaluation as a black box, relying solely on outcome scores without explaining why prompts succeed or fail.
- Prior Knowledge Makes It Possible: From Sublinear Graph Algorithms to LLM Test-Time Methods
Avrim Blum, Daniel Hsu, Cyrus Rashtchian, Donya Saless · Oct 18, 2025 · Citations: 0
Tool Use
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution
Syed Rifat Raiyan, Md Farhan Ishmam, Abdullah Al Imran, Mohammad Ali Moni · Oct 18, 2025 · Citations: 0
Human communication heavily relies on laconism and inferential pragmatics, allowing listeners to successfully reconstruct rich meaning from sparse, telegraphic speech.
- ScholarEval: Research Idea Evaluation Grounded in Literature
Hanane Nour Moussa, Patrick Queiroz Da Silva, Daniel Adu-Ampratwum, Alyson East, Zitong Lu · Oct 17, 2025 · Citations: 0
Rubric Rating
As AI tools become increasingly common for research ideation, robust evaluation is critical to ensure the validity and usefulness of generated ideas.
- SentinelNet: Safeguarding Multi-Agent Collaboration Through Credit-Based Dynamic Threat Detection
Yang Feng, Xudong Pan · Oct 17, 2025 · Citations: 0
- In Generative AI We (Dis)Trust? Computational Analysis of Trust and Distrust in Reddit Discussions
Aria Pessianzadeh, Naima Sultana, Hildegarde Van den Bulck, David Gefen, Shahin Jabbari · Oct 17, 2025 · Citations: 0
The rise of generative AI (GenAI) has impacted many aspects of human life.
- PolySkill: Learning Generalizable Skills Through Polymorphic Abstraction
Simon Yu, Gang Li, Weiyan Shi, Peng Qi · Oct 17, 2025 · Citations: 0
- BIOGEN: Evidence-Grounded Multi-Agent Reasoning Framework for Transcriptomic Interpretation in Antimicrobial Resistance
Elias Hossain, Mehrdad Shoeibi, Ivan Garibay, Niloofar Yousefi · Oct 17, 2025 · Citations: 0
Multi Agent
We present BIOGEN, an evidence-grounded multi-agent framework for post hoc interpretation of RNA-seq transcriptional modules.
- HypoSpace: Evaluating LLM Creativity as Set-Valued Hypothesis Generators under Underdetermination
Tingting Chen, Beibei Lin, Zifeng Yuan, Qiran Zou, Hongyu He · Oct 17, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Language Models are Injective and Hence Invertible
Giorgos Nikolaou, Tommaso Mencattini, Donato Crisostomi, Andrea Santilli, Yannis Panagakis · Oct 17, 2025 · Citations: 0
- OffSim: Offline Simulator for Model-based Offline Inverse Reinforcement Learning
Woo-Jin Ahn, Sang-Ryul Baek, Yong-Jun Lee, Hyun-Duck Choi, Myo-Taeg Lim · Oct 17, 2025 · Citations: 0
- Learning to Answer from Correct Demonstrations
Nirmit Joshi, Gene Li, Siddharth Bhandari, Shiva Prasad Kasiviswanathan, Cong Ma · Oct 17, 2025 · Citations: 0
Demonstrations
We study the problem of learning to generate an answer (or completion) to a question (or prompt), where there could be multiple correct answers, any one of which is acceptable at test time.
- MNO: Multiscale Neural Operator for 3D Computational Fluid Dynamics
Qinxuan Wang, Chuang Wang, Mingyu Zhang, Jingwei Sun, Peipei Yang · Oct 17, 2025 · Citations: 0
We evaluate MNO on diverse benchmarks, covering steady-state and unsteady flow scenarios with up to 300k points.
- When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling
Heecheol Yun, Kwangmin Ki, Junghyun Lee, Eunho Yang · Oct 17, 2025 · Citations: 0
Our experiments on diverse benchmarks, including MATH500 and BBH, demonstrate that SAFE outperforms existing methods in both accuracy and efficiency, with gains achieved even when ensembling fewer than 1% of tokens.
- AI-BAAM: AI-Driven Bank Statement Analytics as Alternative Data for Malaysian MSME Credit Scoring
Chun Chet Ng, Zhen Hao Chu, Jia Yu Lim, Yin Yin Boon, Wei Zeng Low · Oct 17, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Three-dimensional inversion of gravity data using implicit neural representations and scientific machine learning
Pankaj K Mishra, Sanni Laaksonen, Jochen Kamm, Anand Singh · Oct 17, 2025 · Citations: 0
- SAG-Agent: Enabling Long-Horizon Reasoning in Strategy Games via Dynamic Knowledge Graphs
Chenwei Tang, Lin Long, Xinyu Liu, Jingyu Xing, Zizhou Wang · Oct 17, 2025 · Citations: 0
- GUIrilla: A Scalable Framework for Automated Desktop UI Exploration
Sofiya Garkot, Maksym Shamrai, Ivan Synytsia, Mariya Hirna · Oct 16, 2025 · Citations: 0
- Composition-Grounded Data Synthesis for Visual Reasoning
Xinyi Gu, Jiayuan Mao, Zhang-Wei Hong, Zhuoran Yu, Pengyuan Li · Oct 16, 2025 · Citations: 0
- Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn Search Agents
Guoqing Wang, Sunhao Dai, Guangze Ye, Zeyu Gan, Wei Yao · Oct 16, 2025 · Citations: 0
Tool Use
In this paper, we propose Information Gain-based Policy Optimization (IGPO), a simple yet effective RL framework that provides dense and intrinsic supervision for multi-turn agent training.
- CBF-RL: Safety Filtering Reinforcement Learning in Training with Control Barrier Functions
Lizhi Yang, Blake Werner, Massimiliano de Sa, Aaron D. Ames · Oct 16, 2025 · Citations: 0
Web Browsing
Reinforcement learning (RL), while powerful and expressive, can often prioritize performance at the expense of safety.
- DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation
Yu Zhou, Sohyun An, Haikang Deng, Da Yin, Clark Peng · Oct 16, 2025 · Citations: 0
In this work, we study this question by constructing a new large-scale benchmark spanning six common English dialects.
- Circuit Insights: Towards Interpretability Beyond Activations
Elena Golimblevskaia, Aakriti Jain, Bruno Puri, Ammar Ibrahim, Wojciech Samek · Oct 16, 2025 · Citations: 0
- TRI-DEP: A Trimodal Comparative Study for Depression Detection Using Speech, Text, and EEG
Annisaa Fitri Nurfidausi, Eleonora Mancini, Paolo Torroni · Oct 16, 2025 · Citations: 0
However, existing studies are limited in scope, lack systematic comparisons of features, and suffer from inconsistent evaluation protocols.
- Detecting Early and Implicit Suicidal Ideation via Longitudinal and Information Environment Signals on Social Media
Soorya Ram Shimgekar, Ruining Zhao, Agam Goyal, Violeta J. Rodriguez, Paul A. Bloom · Oct 16, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries
Divyat Mahajan, Sachin Goyal, Badr Youbi Idrissi, Mohammad Pezeshki, Ioannis Mitliagkas · Oct 16, 2025 · Citations: 0
- Telling Speculative Stories to Help Humans Imagine the Harms of Healthcare AI
Xingmeng Zhao, Tongnian Wang, Dan Schumacher, Veronica Rammouz, Anthony Rios · Oct 16, 2025 · Citations: 0
Multi Agent
Many recent methods use AI to detect risks automatically, but this can reduce human engagement in understanding how harms arise and who they affect.
- LUMI: Unsupervised Intent Clustering with Multiple Pseudo-Labels
I-Fan Lin, Faegheh Hasibi, Suzan Verberne · Oct 16, 2025 · Citations: 0
Our evaluation on four benchmark sets shows that our approach achieves competitive results, better than recent state-of-the-art baselines, while avoiding the need to estimate the number of clusters during embedding refinement, as is…
- E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task
Jingyao Liu, Chen Huang, Zhizhao Guan, Wenqiang Lei, Yang Deng · Oct 16, 2025 · Citations: 0
Multi Agent
However, existing E2ESD benchmarks are limited by coarse-grained requirement specifications and unreliable evaluation protocols, hindering a true understanding of current framework capabilities.
- PluriHopRAG: Exhaustive, Recall-Sensitive QA Through Corpus-Specific Document Structure Learning
Mykolas Sveistrys, Richard Kunert · Oct 16, 2025 · Citations: 0
To study this setting, we introduce PluriHopWIND, a multilingual diagnostic benchmark of 48 pluri-hop questions over 191 real wind-industry reports, with high repetitiveness to reflect the challenge of distractors in real-world datasets.
- From Binary to Bilingual: How the National Weather Service is Using Artificial Intelligence to Develop a Comprehensive Translation Program
Joseph E. Trujillo-Falcon, Monica L. Bozeman, Liam E. Llewellyn, Samuel T. Halvorson, Meryl Mizell · Oct 16, 2025 · Citations: 0
We also integrated ethical AI practices throughout the program's design, ensuring that transparency, fairness, and human oversight guide how automated translations are created, evaluated, and shared with the public.
- Understanding the Ability of LLMs to Handle Character-Level Perturbation
Anyuan Zhuo, Xuefei Ning, Ningyuan Li, Jingyi Zhu, Yu Wang · Oct 16, 2025 · Citations: 0
Surprisingly, even under severe perturbation, such as shuffling nearly all words character-wise to produce text that is almost unreadable to humans, or inserting invisible characters which are several times more than the visible ones as…
- CodeEvolve: an open source evolutionary coding agent for algorithmic discovery and optimization
Henrique Assumpção, Diego Ferreira, Leandro Campos, Fabricio Murai · Oct 15, 2025 · Citations: 0
We evaluate CodeEvolve on benchmarks used to assess Google DeepMind's AlphaEvolve, and include direct comparisons with popular open-source frameworks for algorithmic discovery and heuristic design.
- REAP the Experts: Why Pruning Prevails for One-Shot MoE compression
Mike Lasby, Ivan Lazarevich, Nish Sinnadurai, Sean Lie, Yani Ioannou · Oct 15, 2025 · Citations: 0
- Readers Prefer Outputs of AI Trained on Copyrighted Books over Expert Human Writers
Tuhin Chakrabarty, Jane C. Ginsburg, Paramveer Dhillon · Oct 15, 2025 · Citations: 0
Pairwise Preference
In blind pairwise evaluations by 28 MFA-trained readers and 516 college-educated general readers, AI text from in-context prompting was strongly disfavored by MFA readers for stylistic fidelity (OR=0.16) and quality (OR=0.13), while general…
- Assessing Web Search Credibility and Response Groundedness in Chat Assistants
Ivan Vykopal, Matúš Pikuliak, Simon Ostermann, Marián Šimko · Oct 15, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- DeDelayed: Deleting Remote Inference Delay via On-Device Correction
Dan Jacobellis, Mateen Ulhaq, Fabien Racapé, Hyomin Choi, Neeraja J. Yadwadkar · Oct 15, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- MVCustom: Multi-View Customized Diffusion via Geometric Latent Rendering and Completion
Minjung Shin, Hyunin Cho, Sooyeon Go, Jin-Hwa Kim, Youngjung Uh · Oct 15, 2025 · Citations: 0
- Closing the Gap Between Text and Speech Understanding in LLMs
Santiago Cuervo, Skyler Seto, Maureen de Seyssel, Richard He Bai, Zijin Gu · Oct 15, 2025 · Citations: 0
Applied to 3B and 7B LLMs, SALAD achieves competitive performance with a strong open-weight model across broad-domain benchmarks in knowledge, language understanding, and reasoning, while training on over an order of magnitude less speech…
- MemoTime: Memory-Augmented Temporal Knowledge Graph Enhanced Large Language Model Reasoning
Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu, Xin Yuan · Oct 15, 2025 · Citations: 0
Comprehensive experiments on multiple temporal QA benchmarks show that MemoTime achieves overall state-of-the-art results, outperforming the strong baseline by up to 24.0%.
- Assessing LLM Reasoning Through Implicit Causal Chain Discovery in Climate Discourse
Liesbeth Allein, Nataly Pineda-Castañeda, Andrea Rocci, Marie-Francine Moens · Oct 15, 2025 · Citations: 0
In a diagnostic evaluation framework, we instruct nine LLMs to generate all possible intermediate causal steps linking given cause-effect pairs in causal chain structures.
- Embedding-Based Context-Aware Reranker
Ye Yuan, Mohammad Amin Shabani, Siqi Liu · Oct 15, 2025 · Citations: 0
We evaluate EBCAR against SOTA rerankers on the ConTEB benchmark, demonstrating its effectiveness for information retrieval requiring cross-passage inference and its advantages in both accuracy and efficiency.
- Mismatch Aware Guidance for Robust Emotion Control in Auto-Regressive TTS Models
Yizhou Peng, Yukun Ma, Chong Zhang, Yi-Wen Chao, Chongjia Ni · Oct 15, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging
Inha Kang, Youngsun Lim, Seonho Lee, Jiho Choi, Junsuk Choe · Oct 15, 2025 · Citations: 0
Our method significantly improves performance on challenging negation benchmarks with a lowered false positive rate, boosting NMS-AP by up to +10.8 points on OVDEval and demonstrating generalization to SoTA VLMs.
- Putting on the Thinking Hats: A Survey on Chain of Thought Fine-tuning from the Perspective of Human Reasoning Mechanism
Xiaoshu Chen, Sihang Zhou, Ke Liang, Duanyang Yuan, Haoyuan Chen · Oct 15, 2025 · Citations: 0
It leverages both supervised and reinforced fine-tuning to cultivate human-like reasoning skills in LLMs, including detailed planning, divergent thinking, intuitive judgment, timely reflection, internal thinking, and fact perception, etc.
- On the Reasoning Abilities of Masked Diffusion Language Models
Anej Svete, Ashish Sabharwal · Oct 15, 2025 · Citations: 0
- LLM Prompt Duel Optimizer: Efficient Label-Free Prompt Optimization
Yuanchen Wu, Saurabh Verma, Justin Lee, Fangzhou Xiong, Poppy Zhang · Oct 14, 2025 · Citations: 0
Pairwise Preference
We propose the Prompt Duel Optimizer (PDO), a sample-efficient framework for label-free prompt optimization based on pairwise preference feedback from an LLM judge.
- Schema for In-Context Learning
Pan Chen, Shaohong Chen, Mark Wang, Shi Xuan Leong, Priscilla Fung · Oct 14, 2025 · Citations: 0
Demonstrations
Inspired by cognitive science, specifically schema theory, which holds that humans interpret new information by activating pre-existing mental frameworks (schemas) to structure understanding, we introduce Schema-Activated In-Context…
- Reveal-to-Revise: Explainable Bias-Aware Generative Modeling with Multimodal Attention
Noor Islam S. Mohammad, Md Muntaqim Meherab · Oct 14, 2025 · Citations: 0
- Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences
Julian Minder, Clément Dumas, Stewart Slocum, Helena Casademunt, Cameron Holmes · Oct 14, 2025 · Citations: 0
We demonstrate that these analyses contain crucial information by creating an LLM-based interpretability agent to understand the finetuning domain.
- Toward LLM-Supported Automated Assessment of Critical Thinking Subskills
Marisa C. Peczuh, Nischal Ashok Kumar, Ryan Baker, Blair Lehman, Danielle Eisenberg · Oct 14, 2025 · Citations: 0
Rubric Rating
As the world becomes increasingly saturated with AI-generated content, disinformation, and algorithmic persuasion, critical thinking - the capacity to evaluate evidence, detect unreliable claims, and exercise independent judgment - is…
- Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception
Ziyang Ma, Ruiyang Xu, Zhenghao Xing, Yunfei Chu, Yuxuan Wang · Oct 14, 2025 · Citations: 0
- When Personalization Tricks Detectors: The Feature-Inversion Trap in Machine-Generated Text Detection
Lang Gao, Xuhui Li, Chenxi Wang, Mingzhe Li, Wei Liu · Oct 14, 2025 · Citations: 0
In this paper, we introduce \dataset, the first benchmark for evaluating detector robustness in personalized settings, built from literary and blog texts paired with their LLM-generated imitations.
- Community size rather than grammatical complexity better predicts Large Language Model accuracy in a novel Wug Test
Nikoleta Pantelidou, Evelina Leivada, Raquel Montero, Paolo Morosi · Oct 14, 2025 · Citations: 0
The aim is to determine whether model accuracy approximates human competence and whether it is shaped primarily by linguistic complexity or by the size of the linguistic community, which affects the quantity of available training data.
- PRoH: Dynamic Planning and Reasoning over Knowledge Hypergraphs for Retrieval-Augmented Generation
Xiangjun Zai, Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu · Oct 14, 2025 · Citations: 0
Long Horizon
Experiments across multiple domains demonstrate that PRoH achieves state-of-the-art performance, surpassing the prior SOTA model HyperGraphRAG by an average of 19.73% in F1 and 8.41% in Generation Evaluation (G-E) score, while maintaining…
- An Order-Sensitive Conflict Measure for Random Permutation Sets
Ruolan Cheng, Yong Deng · Oct 14, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- MCP Security Bench (MSB): Benchmarking Attacks Against Model Context Protocol in LLM Agents
Dongsen Zhang, Zekun Li, Xu Luo, Xuannan Liu, Peipei Li · Oct 14, 2025 · Citations: 0