- Position: Beyond Sensitive Attributes, ML Fairness Should Quantify Structural Injustice via Social Determinants
Zeyu Tang, Alex John London, Atoosa Kasirzadeh, Sarah Stewart de Ramirez, Peter Spirtes · Aug 10, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- From Product Hilbert Spaces to the Generalized Koopman Operator and the Nonlinear Fundamental Lemma
Mircea Lazar · Aug 10, 2025 · Citations: 0
- ObfusQAte: A Proposed Framework to Evaluate LLM Robustness on Obfuscated Factual Question Answering
Shubhra Ghosh, Abhilekh Borah, Aditya Kumar Guru, Kripabandhu Ghosh · Aug 10, 2025 · Citations: 0
- SQL-Exchange: Transforming SQL Queries Across Domains
Mohammadreza Daviran, Brian Lin, Davood Rafiei · Aug 9, 2025 · Citations: 0
Our comprehensive evaluation across multiple model families and benchmark datasets -- assessing structural alignment with source queries, execution validity on target databases, and semantic correctness -- demonstrates that SQL-Exchange is…
- IntrinsicWeather: Controllable Weather Editing in Intrinsic Space
Yixin Zhu, Zuo-Liang Zhu, Jian Yang, Miloš Hašan, Jin Xie · Aug 9, 2025 · Citations: 0
- Seeing Through the Noise: Improving Infrared Small Target Detection and Segmentation from Noise Suppression Perspective
Maoxun Yuan, Duanni Meng, Ziteng Xi, Tianyi Zhao, Shiji Zhao · Aug 9, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- SEVADE: Self-Evolving Multi-Agent Analysis with Decoupled Evaluation for Hallucination-Resistant Irony Detection
Ziqi Liu, Ziyang Zhou, Yilin Li, Mingxuan Hu, Yushan Pan · Aug 9, 2025 · Citations: 0
Multi Agent
To address these challenges, we propose **SEVADE**, a novel **S**elf-**Ev**olving multi-agent **A**nalysis framework with **D**ecoupled **E**valuation for hallucination-resistant sarcasm detection.
- Personalized Feature Translation for Expression Recognition: An Efficient Source-Free Domain Adaptation Method
Masoumeh Sharafi, Soufiane Belharbi, Muhammad Osama Zeeshan, Houssem Ben Salem, Ali Etemad · Aug 8, 2025 · Citations: 0
Facial expression recognition (FER) models are widely used in video-based affective computing applications, such as human-computer interaction and healthcare monitoring.
- Memp: Exploring Agent Procedural Memory
Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao · Aug 8, 2025 · Citations: 0
Large Language Models (LLMs) based agents excel at diverse tasks, yet they suffer from brittle procedural memory that is manually engineered or entangled in static parameters.
- EvolvR: Self-Evolving Pairwise Reasoning for Story Evaluation to Enhance Generation
Xinda Wang, Zhengxu Hou, Yangshijie Zhang, Bingren Yan, Jialin Liu · Aug 8, 2025 · Citations: 0
Pairwise Preference Multi Agent
Although the effectiveness of Large Language Models (LLMs) as judges (LLM-as-a-judge) has been validated, their performance remains limited in open-ended tasks, particularly in story evaluation.
- Large language models show fragile cognitive reasoning about human emotions
Sree Bhattacharyya, Evgenii Kuriabov, Lucas Craig, Tharun Dilliraj, Reginald B. Adams, · Aug 7, 2025 · Citations: 0
Affective computing seeks to support the holistic development of artificial intelligence by enabling machines to engage with human emotion.
- MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy
Shaoxiong Zhan, Yanlin Lai, Ziyu Lu, Dahua Lin, Ziqing Yang · Aug 7, 2025 · Citations: 0
Existing synthesis methods largely rely on transforming human-written templates, limiting both diversity and scalability.
- Mixed-Initiative Dialog for Human-Robot Collaborative Manipulation
Albert Yu, Chengshu Li, Luca Macesanu, Arnav Balaji, Ruchira Ray · Aug 7, 2025 · Citations: 0
Long Horizon
Effective robotic systems for long-horizon human-robot collaboration must adapt to a wide range of human partners, whose physical behavior, willingness to assist, and understanding of the robot's capabilities may change over time.
- LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models
Ming Zhang, Yujiong Shen, Jingyi Deng, Yuhui Wang, Huayu Sha · Aug 7, 2025 · Citations: 0
To address this, we introduce LLMEval-Fair, a framework for dynamic evaluation of LLMs.
- Unsupervised Learning for Inverse Problems in Computed Tomography
Laura Hellwege, Johann Christopher Engster, Moritz Schaar, Thorsten M. Buzug, Maik Stille · Aug 7, 2025 · Citations: 0
- Not All Errors Are Created Equal: ASCoT Addresses Late-Stage Fragility in Efficient LLM Reasoning
Dongxu Zhang, Yujun Wu, Yiding Sun, Jinnan Yang, Ning Yang · Aug 7, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- QA-Dragon: Query-Aware Dynamic RAG System for Knowledge-Intensive Visual Question Answering
Zhuohang Jiang, Pangjing Wu, Xu Yuan, Wenqi Fan, Qing Li · Aug 7, 2025 · Citations: 0
By orchestrating both text and image search agents in a hybrid setup, our system supports multimodal, multi-turn, and multi-hop reasoning, enabling it to tackle complex VQA tasks effectively.
- Learning from Oblivion: Predicting Knowledge Overflowed Weights via Retrodiction of Forgetting
Jinhyeok Jang, Jaehong Kim, Jung Uk Kim · Aug 7, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- TURA: Tool-Augmented Unified Retrieval Agent for AI Search
Zhejun Zhao, Yuchen Li, Alley Liu, Yuehu Dong, Xiaolong Wei · Aug 6, 2025 · Citations: 0
Web Browsing
To bridge this gap, we introduce TURA (Tool-Augmented Unified Retrieval Agent for AI Search), a novel three-stage framework that combines RAG with agentic tool-use to access both static content and dynamic, real-time information.
- Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning
Magauiya Zhussip, Dmitriy Shopkhoev, Ammar Ali, Stamatios Lefkimmiatis · Aug 6, 2025 · Citations: 0
Experiments across scales (100M-700M parameters) show that MASA achieves better benchmark accuracy and perplexity than GQA, low-rank baselines and recent Repeat-all-over/Sequential sharing at comparable parameter budgets.
- LayerT2V: A Unified Multi-Layer Video Generation Framework
Guangzhao Li, Kangrui Cen, Baixuan Zhao, Yi Xin, Siqi Luo · Aug 6, 2025 · Citations: 0
Long Horizon
Text-to-video generation has advanced rapidly, but existing methods typically output only the final composited video and lack editable layered representations, limiting their use in professional workflows.
- STEMTOX: From Social Tags to Fine-Grained Toxic Meme Detection via Entropy-Guided Multi-Task Learning
Subhankar Swain, Naquee Rizwan, Vishwa Gangadhar S, Nayandeep Deb, Animesh Mukherjee · Aug 6, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- CoAct-1: Computer-using Multi-Agent System with Coding Actions
Linxin Song, Yutong Dai, Viraj Prabhu, Jieyu Zhang, Taiwei Shi · Aug 5, 2025 · Citations: 0
Long Horizon
In this work, we introduce a more robust and flexible paradigm: enabling agents to use coding as a enhanced action.
- Hidden Dynamics of Massive Activations in Transformer Training
Jorge Gallego-Feliciano, S. Aaron McClendon, Juan Morinelli, Stavros Zervoudakis, Antonios Saravanos · Aug 5, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Cropping outperforms dropout as an augmentation strategy for self-supervised training of text embeddings
Rita González-Márquez, Philipp Berens, Dmitry Kobak · Aug 5, 2025 · Citations: 0
- VLMQ: Token Saliency-Driven Post-Training Quantization for Vision-language Models
Yufei Xue, Yushi Huang, Jiawei Shao, Lunjie Zhu, Chi Zhang · Aug 5, 2025 · Citations: 0
Extensive evaluations on 8 benchmarks across 0.5B\sim32B VLMs demonstrate the state-of-the-art (SOTA) performance of our VLMQ, particularly under low-bit settings.
- RooseBERT: A New Deal For Political Language Modelling
Deborah Dore, Elena Cabrio, Serena Villata · Aug 5, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- When Algorithms Meet Artists: Semantic Compression of Artists' Concerns in the Public AI-Art Debate
Ariya Mukherjee-Gandhi, Oliver Muellerklein · Aug 5, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Towards Effective Offensive Security LLM Agents: Hyperparameter Tuning, LLM as a Judge, and a Lightweight CTF Benchmark
Minghao Shao, Nanda Rani, Kimberly Milner, Haoran Xi, Meet Udeshi · Aug 5, 2025 · Citations: 0
- PoeTone: A Framework for Constrained Generation of Structured Chinese Songci with LLMs
Zhan Qu, Shuzhou Yuan, Michael Färber · Aug 4, 2025 · Citations: 0
We first develop a comprehensive, multi-faceted evaluation framework that includes: (i) a formal conformity score, (ii) automated quality assessment using LLMs, (iii) human evaluation, and (iv) classification-based probing tasks.
- MolReasoner: Toward Effective and Interpretable Reasoning for Molecular LLMs
Guojiang Zhao, Zixiang Lu, Yutang Ge, Sihang Li, Zheng Cheng · Aug 4, 2025 · Citations: 0
Extensive evaluations demonstrate that MolReasoner significantly outperforms a wide range of strong baselines in both molecule generation and captioning tasks.
- Harnessing Temporal Databases for Systematic Evaluation of Factual Time-Sensitive Question-Answering in Large Language Models
Soyeon Kim, Jindong Wang, Xing Xie, Steven Euijong Whang · Aug 4, 2025 · Citations: 0
- SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents
Changhao Jiang, Jiajun Sun, Yifei Cao, Jiabao Zhuang, Xinmeng Che · Aug 4, 2025 · Citations: 0
Speech is essential for realistic role-playing, yet existing work on role-playing agents largely centers on text, leaving Speech Role-Playing Agents (SRPAs) underexplored and without systematic evaluation.
- Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning
Xinting Huang, Michael Hahn · Aug 3, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- MLP Memory: A Retriever-Pretrained Memory for Large Language Models
Rubin Wei, Jiaqi Cao, Jiarui Wang, Jushi Kai, Qipeng Guo · Aug 3, 2025 · Citations: 0
- LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?
Guozhao Mo, Wenliang Zhong, Jiawei Chen, Qianhao Yuan, Xuanang Chen · Aug 3, 2025 · Citations: 0
Tool Use
Unfortunately, there is still a large gap between real-world MCP usage and current evaluation: they typically assume single-server settings and directly inject tools into the model's context, bypassing the challenges of large-scale…
- A Theory of Adaptive Scaffolding for LLM-Based Pedagogical Agents
Clayton Cohn, Surya Rayala, Namrata Srivastava, Joyce Horn Fonteles, Shruti Jain · Aug 2, 2025 · Citations: 0
- Towards Efficient Medical Reasoning with Minimal Fine-Tuning Data
Xinlin Zhuang, Feilong Tang, Haolin Yang, Xiwei Liu, Ming Hu · Aug 2, 2025 · Citations: 0
Furthermore, Human and LLM-as-a-judge evaluations show that DIQ-selected subsets demonstrate higher data quality and generate clinical reasoning that is more aligned with expert practices in differential diagnosis, safety check, and…
- RoboMemory: A Brain-inspired Multi-memory Agentic Framework for Interactive Environmental Learning in Physical Embodied Systems
Mingcong Lei, Honghao Cai, Yuyuan Yang, Yimou Wu, Jinke Ren · Aug 2, 2025 · Citations: 0
- NaturalGAIA: A Verifiable Benchmark and Hierarchical Framework for Long-Horizon GUI Tasks
Zihan Zheng, Tianle Cui, Taoran Wang, Fengtao Wang, Jiahui Pan · Aug 2, 2025 · Citations: 0
- WebDS: An End-to-End Benchmark for Web-based Data Science
Ethan Hsu, Hong Meng Yam, Ines Bouissou, Aaron Murali John, Raj Thota · Aug 2, 2025 · Citations: 0
Long Horizon
In response, we introduce WebDS, the first end-to-end web-based data science benchmark.
- GHTM: A Graph-based Hybrid Topic Modeling Approach with a Benchmark Dataset for the Low-Resource Bengali Language
Farhana Haque, Md. Abdur Rahman, Sumon Ahmed · Aug 1, 2025 · Citations: 0
Existing Bengali topic modeling research lacks standardized evaluation frameworks with comprehensive baselines and diverse datasets, exploration of modern methodological approaches, and reproducible implementations, with only three…
- Activation-Guided Local Editing for Jailbreaking Attacks
Jiecong Wang, Haoran Li, Hao Peng, Ziqian Zeng, Zihao Wang · Aug 1, 2025 · Citations: 0
Red Team
Token-level jailbreak attacks often produce incoherent or unreadable inputs and exhibit poor transferability, while prompt-level attacks lack scalability and rely heavily on manual effort and human ingenuity.
- Acoustic Imaging for Low-SNR UAV Detection: Dense Beamformed Energy Maps and U-Net SELD
Belman Jahir Rodriguez, Sergio F. Chevtchenko, Marcelo Herrera Martinez, Yeshwant Bethy, Saeed Afshar · Aug 1, 2025 · Citations: 0
- RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization
Yihong Dong, Xue Jiang, Yongding Tao, Huanyu Liu, Kechi Zhang · Jul 31, 2025 · Citations: 0
Compared with existing RLVR methods, RL-PLUS achieves 1) state-of-the-art performance on six math reasoning benchmarks; 2) superior performance on six out-of-distribution reasoning tasks; 3) consistent and significant gains across diverse…
- Role-Aware Language Models for Secure and Contextualized Access Control in Organizations
Saeed Almheiri, Yerulan Kongrat, Adrian Santosh, Ruslan Tasmukhanov, Josemaria Loza Vera · Jul 31, 2025 · Citations: 0
Red Team
Existing safety methods typically assume uniform access and focus on preventing harmful or toxic outputs, without addressing role-specific access constraints.
- Uncovering the Fragility of Trustworthy LLMs through Chinese Textual Ambiguity
Xinwei Wu, Haojie Li, Hongyu Liu, Xinyu Ji, Ruohan Li · Jul 30, 2025 · Citations: 0
We created a benchmark dataset by collecting and generating ambiguous sentences with context and their corresponding disambiguated pairs, representing multiple possible interpretations.
- League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models
Qianhong Guo, Wei Xie, Xiaofang Cai, Enze Wang, Shuoyoucheng Ma · Jul 30, 2025 · Citations: 0
- Better Together: Cross and Joint Covariances Enhance Signal Detectability in Undersampled Data
Arabind Swain, Sean Alexander Ridout, Ilya Nemenman · Jul 29, 2025 · Citations: 0
- UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding
Shuquan Lian, Yuhang Wu, Jia Ma, Yifan Ding, Zihan Song · Jul 29, 2025 · Citations: 0
The emergence of Multimodal Large Language Models (MLLMs) has driven significant advances in Graphical User Interface (GUI) agent capabilities.
- Who's important? -- SUnSET: Synergistic Understanding of Stakeholder, Events and Time for Timeline Generation
Tiviatis Sim, Kaiwen Yang, Shen Xin, Kenji Kawaguchi · Jul 29, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Soft Head Selection for Injecting ICL-Derived Task Embeddings
Jungwon Park, Jimyeong Kim, Changin Choi, Wonjong Rhee · Jul 28, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- A survey of diversity quantification in natural language processing: The why, what, where and how
Louis Estève, Marie-Catherine de Marneffe, Nurit Melnik, Agata Savary, Olha Kanishcheva · Jul 28, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- MMGraphRAG: Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs
Xueyao Wan, Hang Yu · Jul 28, 2025 · Citations: 0
- Enhancing Jailbreak Attacks on LLMs via Persona Prompts
Zheng Zhang, Peilin Zhao, Deheng Ye, Hao Wang · Jul 28, 2025 · Citations: 0