- SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning
Bo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui Qi · Jun 30, 2025 · Citations: 0
Multi Agent
We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, generating an automatic curriculum of stronger opponents, and eliminating the need…
- TaP: A Taxonomy-Guided Framework for Automated and Scalable Preference Data Generation
Renren Jin, Tianhao Shen, Xinwei Wu, Dan Shi, Haoran Sun · Jun 30, 2025 · Citations: 0
Pairwise Preference
Conducting supervised and preference fine-tuning of large language models (LLMs) requires high-quality datasets to improve their ability to follow instructions and align with human preferences and values.
- Unveiling Decision-Making in LLMs for Text Classification : Extraction of influential and interpretable concepts with Sparse Autoencoders
Mathis Le Bail, Jérémie Dentan, Davide Buscaldi, Sonia Vanier · Jun 30, 2025 · Citations: 0
These concepts are linear combinations of neuron activations that correspond to human-interpretable features.
- Why Reinforcement Fine-Tuning Enables MLLMs Preserve Prior Knowledge Better: A Data Perspective
Zhihao Zhang, Qiaole Dong, Qi Zhang, Jun Zhao, Enyu Zhou · Jun 30, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- MindCube: Spatial Mental Modeling from Limited Views
Qineng Wang, Baiqiao Yin, Pingyue Zhang, Jianshu Zhang, Kangrui Wang · Jun 26, 2025 · Citations: 0
Can Vision-Language Models (VLMs) imagine the full scene from just a few views, like humans do?
- Complexity-aware fine-tuning
Andrey Goncharov, Daniil Vyazhev, Petr Sychev, Edvard Khalafyan, Alexey Zaytsev · Jun 26, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Multi-lingual Functional Evaluation for Large Language Models
Victor Ojewale, Inioluwa Deborah Raji, Suresh Venkatasubramanian · Jun 25, 2025 · Citations: 0
Multi-lingual competence in large language models is often evaluated via static data benchmarks such as Belebele, M-MMLU and M-GSM.
- Cognitive models can reveal interpretable value trade-offs in language models
Sonia K. Murthy, Rosie Zhao, Jennifer Hu, Sham Kakade, Markus Wulfmeier · Jun 25, 2025 · Citations: 0
- $π$-CoT: Prolog-Initialized Chain-of-Thought Prompting for Multi-Hop Question-Answering
Chao Wan, Albert Gong, Mihir Mishra, Carl-Leander Henneking, Claas Beger · Jun 25, 2025 · Citations: 0
Extensive experiments demonstrate that π-CoT significantly outperforms standard RAG and in-context CoT on multi-hop question-answering benchmarks.
- An Agentic System for Rare Disease Diagnosis with Traceable Reasoning
Weike Zhao, Chaoyi Wu, Yanjie Fan, Xiaoman Zhang, Pengcheng Qiu · Jun 25, 2025 · Citations: 0
Expert Verification Multi Agent
Here we present DeepRare, a multi-agent system for rare disease differential diagnosis decision support powered by large language models, integrating over 40 specialized tools and up-to-date knowledge sources.
- Breaking Barriers: Do Reinforcement Post Training Gains Transfer To Unseen Domains?
Chuxuan Hu, Yuxuan Zhu, Antony Kellermann, Caleb Biddulph, Suppakit Waiwitlikhit · Jun 24, 2025 · Citations: 0
- TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems
Christoph Minixhofer, Ondrej Klejch, Peter Bell · Jun 24, 2025 · Citations: 0
Evaluation of Text to Speech (TTS) systems is challenging and resource-intensive.
- LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning
Yuhao Wu, Yushi Bai, Zhiqiang Hu, Roy Ka-Wei Lee, Juanzi Li · Jun 23, 2025 · Citations: 0
Experimental evaluations show that our LongWriter-Zero model, trained from Qwen2.5-32B, consistently outperforms traditional SFT methods on long-form writing tasks, achieving state-of-the-art results across all metrics on WritingBench and…
- Programming by Backprop: An Instruction is Worth 100 Examples When Finetuning LLMs
Jonathan Cook, Silvia Sapora, Arash Ahmadian, Akbir Khan, Tim Rocktaschel · Jun 23, 2025 · Citations: 0
Demonstrations
Though execution of instructions in training data remains less reliable than when instructions are given in-context, our results demonstrate that procedural knowledge can be noisily `programmed' into LLMs through PBB, with important…
- Context Biasing for Pronunciation-Orthography Mismatch in Automatic Speech Recognition
Christian Huber, Alexander Waibel · Jun 23, 2025 · Citations: 0
- Parallel Continuous Chain-of-Thought with Jacobi Iteration
Haoyi Wu, Zhihao Teng, Kewei Tu · Jun 23, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- A Simple "Motivation" Can Enhance Reinforcement Finetuning of Large Reasoning Models
Junjie Zhang, Guozheng Ma, Shunyu Liu, Haoyu Wang, Jiaxing Huang · Jun 23, 2025 · Citations: 0
This motivates us to explore if large reasoning models can benefit from a motivation of the task, i.e., awareness of the reward function, during the reinforcement finetuning process, as we humans sometimes do when learning.
- Improving Black-Box Generative Attacks via Generator Semantic Consistency
Jongoh Jeong, Hunmin Yang, Jaeseok Jeong, Kuk-Jin Yoon · Jun 23, 2025 · Citations: 0
- PDF Retrieval Augmented Question Answering
Thi Thu Uyen Hoang, Viet Anh Nguyen · Jun 22, 2025 · Citations: 0
We provide an in-depth experimental evaluation of our solution, demonstrating its capability to extract accurate information that can be applied to different types of content across PDFs.
- LLM Probability Concentration: How Alignment Shrinks the Generative Horizon
Chenghao Yang, Sida Li, Ari Holtzman · Jun 22, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Accelerating Residual Reinforcement Learning with Uncertainty Estimation
Lakshita Dodeja, Karl Schmeckpeper, Shivam Vats, Thomas Weng, Mingxi Jia · Jun 21, 2025 · Citations: 0
- Towards AI Search Paradigm
Yuchen Li, Hengyi Cai, Rui Kong, Xinran Chen, Jiamin Chen · Jun 20, 2025 · Citations: 0
In this paper, we introduce the AI Search Paradigm, a comprehensive blueprint for next-generation search systems capable of emulating human information processing and decision-making.
- PersonalAI: A Systematic Comparison of Knowledge Graph Storage and Retrieval Approaches for Personalized LLM agents
Mikhail Menschikov, Dmitry Evseev, Victoria Dochkina, Ruslan Kostoev, Ilia Perepechkin · Jun 20, 2025 · Citations: 0
We evaluate our system on three benchmarks: TriviaQA, HotpotQA, DiaASQ and demonstrate that different memory and retrieval configurations yield optimal performance depending on the task.
- Multimodal Fused Learning for Solving the Generalized Traveling Salesman Problem in Robotic Task Planning
Jiaqi Chen, Mingfeng Fan, Xuefeng Zhang, Jingsong Liang, Yuhong Cao · Jun 20, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- DistillNote: Toward a Functional Evaluation Framework of LLM-Generated Clinical Note Summaries
Heloisa Oss Boll, Antonio Oss Boll, Leticia Puttlitz Boll, Ameen Abu Hanna, Iacer Calixto · Jun 20, 2025 · Citations: 0
Expert Verification
This study introduces DistillNote, an evaluation framework for LLM summaries that targets their functional utility by applying the generated summary downstream in a complex clinical prediction task, explicitly quantifying how much…
- Long-Context Generalization with Sparse Attention
Pavlo Vasylenko, Hugo Pitorro, André F. T. Martins, Marcos Treviso · Jun 19, 2025 · Citations: 0
Our empirical evaluation on synthetic tasks and language modeling demonstrates that ASEntmax substantially outperforms softmax, scalable softmax, and fixed-temperature α-entmax baselines, achieving up to 1000\times length extrapolation on…
- A Scoping Review of Synthetic Data Generation by Language Models in Biomedical Research and Application: Data Utility and Quality Perspectives
Hanshu Rao, Weisi Liu, Haohan Wang, I-Chan Huang, Zhe He · Jun 19, 2025 · Citations: 0
Evaluations were heterogeneous: intrinsic metrics (27.1\%), human-in-the-loop assessments (44.1\%), and LLM-based evaluations (13.6\%).
- Measuring Intent Comprehension in LLMs
Nadav Kunievsky, James A. Evans · Jun 19, 2025 · Citations: 0
People judge interactions with large language models (LLMs) as successful when outputs match what they want, not what they type.
- Revela: Dense Retriever Learning via Language Modeling
Fengyu Cai, Tong Chen, Xinran Zhao, Sihao Chen, Hongming Zhang · Jun 19, 2025 · Citations: 0
We evaluate Revela on domain-specific (CoIR), reasoning-intensive (BRIGHT), and general-domain (BEIR) benchmarks across various retriever backbones.
- When Does Divide and Conquer Work for Long Context LLM? A Noise Decomposition Framework
Zhen Xu, Shang Zhu, Jue Wang, Junlin Wang, Ben Athiwaratkun · Jun 19, 2025 · Citations: 0
- OJBench: A Competition Level Code Benchmark For Large Language Models
Zhexu Wang, Yiping Liu, Yejie Wang, Wenyang He, Bofei Gao · Jun 19, 2025 · Citations: 0
- GenRecal: Generation after Recalibration from Large to Small Vision-Language Models
Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Yu-Chiang Frank Wang, Yueh-Hua Wu · Jun 18, 2025 · Citations: 0
- SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling
Md Imbesat Hassan Rizvi, Xiaodan Zhu, Iryna Gurevych · Jun 18, 2025 · Citations: 0
Long Horizon
To address this, we introduce Single-Pass Annotation with Reference-Guided Evaluation (SPARE), a novel structured framework that enables efficient per-step annotation by jointly aligning solution steps to reference solutions and determine…
- DeVisE: Behavioral Testing of Medical Large Language Models
Camila Zurdo Tagliabue, Heloisa Oss Boll, Aykut Erdem, Erkut Erdem, Iacer Calixto · Jun 18, 2025 · Citations: 0
Large language models (LLMs) are increasingly applied in clinical decision support, yet current evaluations rarely reveal whether their outputs reflect genuine medical reasoning or superficial correlations.
- ConLID: Supervised Contrastive Learning for Low-Resource Language Identification
Negar Foroutan, Jakhongir Saydaliev, Ye Eun Kim, Antoine Bosselut · Jun 18, 2025 · Citations: 0
- Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts
Kartik Sharma, Yiqiao Jin, Vineeth Rakesh, Yingtong Dou, Menghai Pan · Jun 18, 2025 · Citations: 0
Red Team
As large language models (LLMs) are deployed in safety-critical settings, it is essential to ensure that their responses comply with safety standards.
- BMFM-RNA: whole-cell expression decoding improves transcriptomic foundation models
Michael M. Danziger, Bharath Dandala, Viatcheslav Gurev, Matthew Madgwick, Sivan Ravid · Jun 17, 2025 · Citations: 0
Models trained with these objectives achieve best overall performance across CZI benchmarks, on zero-shot batch integration and linear probing cell-type annotation.
- LingoLoop Attack: Trapping MLLMs via Linguistic Context and State Entrapment into Endless Loops
Jiyuan Fu, Kaixun Jiang, Lingyi Hong, Jinglun Li, Haijing Guo · Jun 17, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- RedTopic: Toward Topic-Diverse Red Teaming of Large Language Models
Jiale Ding, Xiang Zheng, Yutao Wu, Cong Wang, Wei-Bin Lee · Jun 17, 2025 · Citations: 0
Red Team
It tests LLMs with adversarial prompts to uncover vulnerabilities and improve safety alignment.
- Hope Speech Detection in code-mixed Roman Urdu tweets: A Positive Turn in Natural Language Processing
Muhammad Ahmad, Muhammad Waqas, Ameer Hamza, Ildar Batyrshin, Grigori Sidorov · Jun 17, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents
Jingxu Xie, Dylan Xu, Xuandong Zhao, Dawn Song · Jun 17, 2025 · Citations: 0
Long Horizon
We introduce AgentSynth, a scalable and cost-efficient pipeline for automatically synthesizing high-quality tasks and trajectory datasets for generalist computer-use agents.
- Instruction Following by Principled Boosting Attention of Large Language Models
Vitoria Guardieiro, Avishree Khare, Adam Stein, Eric Wong · Jun 16, 2025 · Citations: 0
Yet in practice these constraints can be violated under long contexts or when user-provided context conflicts with them, creating reliability and safety risks.
- Language Agents for Hypothesis-driven Clinical Decision Making with Reinforcement Learning
David Bani-Harouni, Chantal Pellegrini, Ege Özsoy, Nassir Navab, Matthias Keicher · Jun 16, 2025 · Citations: 0
Expert Verification
In contrast to this, we propose to model clinical decision-making for diagnosis with a hypothesis-driven uncertainty-aware language agent, LA-CDM, that converges towards a diagnosis via repeatedly requesting and interpreting relevant tests.
- DualEdit: Mitigating Safety Fallback in LLM Backdoor Editing via Affirmation-Refusal Regulation
Houcheng Jiang, Zetong Zhao, Junfeng Fang, Haokai Ma, Ruipeng Wang · Jun 16, 2025 · Citations: 0
Safety-aligned large language models (LLMs) remain vulnerable to backdoor attacks.
- Dynamic Reinsurance Treaty Bidding via Multi-Agent Reinforcement Learning
Stella C. Dong, James R. Finlay · Jun 16, 2025 · Citations: 0
Multi Agent
This paper develops a novel multi-agent reinforcement learning (MARL) framework for reinsurance treaty bidding, addressing long-standing inefficiencies in traditional broker-mediated placement processes.
- MemeMind: A Large-Scale Multimodal Dataset with Chain-of-Thought Reasoning for Harmful Meme Detection
Hexiang Gu, Qifan Yu, Yuan Liu, Zikang Li, Saihui Hou · Jun 15, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- $\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts
Mert Cemri, Nived Rajaraman, Rishabh Tiwari, Xiaoxuan Liu, Kurt Keutzer · Jun 15, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models
Maximilian Kreutner, Marlene Lutz, Markus Strohmaier · Jun 13, 2025 · Citations: 0
We evaluate whether predictions are stable in response to counterfactual arguments, different persona prompts, and generation methods.
- Differential Privacy in Machine Learning: A Survey from Symbolic AI to LLMs
Francisco Aguilera-Martínez, Fernando Berzal · Jun 13, 2025 · Citations: 0
- Are LLMs Good Text Diacritizers? An Arabic and Yoruba Case Study
Hawau Olamide Toyin, Samar Mohamed Magdy, Hanan Aldarmaki · Jun 13, 2025 · Citations: 0
- A Lightweight IDS for Early APT Detection Using a Novel Feature Selection Method
Bassam Noori Shaker, Bahaa Al-Musawi, Mohammed Falih Hassan · Jun 13, 2025 · Citations: 0
The results of our proposed method showed the ability to reduce the selected features of the SCVIC-APT-2021 dataset from 77 to just four while maintaining consistent evaluation metrics for the suggested system.
- DRIFT: Dynamic Rule-Based Defense with Injection Isolation for Securing LLM Agents
Hao Li, Xiaogeng Liu, Hung-Chun Chiu, Dianqi Li, Ning Zhang · Jun 13, 2025 · Citations: 0
- Spurious Rewards: Rethinking Training Signals in RLVR
Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang · Jun 12, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- VINCIE: Unlocking In-context Image Editing from Video
Leigang Qu, Feng Cheng, Ziyan Yang, Qi Zhao, Shanchuan Lin · Jun 12, 2025 · Citations: 0
- Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles
Qingyan Wei, Yaojie Zhang, Zhiyuan Liu, Puyu Zeng, Yuxuan Wang · Jun 12, 2025 · Citations: 0
Extensive experiments across benchmarks and models show that SlowFast Sampling achieves up to 15.63\times speedup on LLaDA with minimal accuracy drop, and up to 34.22\times when combined with caching.
- Size-adaptive Hypothesis Testing for Fairness
Antonio Ferrara, Francesco Cozzi, Alan Perotti, André Panisson, Francesco Bonchi · Jun 12, 2025 · Citations: 0
We validate our approach empirically on benchmark datasets, demonstrating how our tests provide interpretable, statistically rigorous decisions under varying degrees of data availability and intersectionality.
- Probabilistic distances-based hallucination detection in LLMs with RAG
Rodion Oblovatny, Alexandra Kuleshova, Konstantin Polev, Alexey Zaytsev · Jun 11, 2025 · Citations: 0
Detecting hallucinations in large language models (LLMs) is critical for their safety in many applications.
- ICE-ID: A Novel Historical Census Dataset for Longitudinal Identity Resolution
Gonçalo Hora de Carvalho, Lazar S. Popov, Sander Kaatee, Mário S. Correia, Kristinn R. Thórisson · Jun 11, 2025 · Citations: 0
We introduce ICE-ID, a benchmark dataset comprising 984,028 records from 16 Icelandic census waves spanning 220 years (1703--1920), with 226,864 expert-curated person identifiers.
- Query-Level Uncertainty in Large Language Models
Lihu Chen, Gerard de Melo, Fabian M. Suchanek, Gaël Varoquaux · Jun 11, 2025 · Citations: 0
- InstructPro: Natural Language Guided Ligand-Binding Protein Design
Zhenqiao Song, Ramith Hettiarachchi, Chuan Li, Jianwen Xie, Lei Li · Jun 11, 2025 · Citations: 0
To enable training and evaluation, we develop InstructProBench, a large-scale dataset of 9.6 million (function description, ligand, protein) triples.