- Theory of Mind in Action: The Instruction Inference Task in Dynamic Human-Agent Collaboration
Fardin Saad, Pradeep K. Murukannaiah, Munindar P. Singh · Jun 26, 2025 · Citations: 0
- MindCube: Spatial Mental Modeling from Limited Views
Qineng Wang, Baiqiao Yin, Pingyue Zhang, Jianshu Zhang, Kangrui Wang · Jun 26, 2025 · Citations: 0
Can Vision-Language Models (VLMs) imagine the full scene from just a few views, like humans do?
- Complexity-aware fine-tuning
Andrey Goncharov, Daniil Vyazhev, Petr Sychev, Edvard Khalafyan, Alexey Zaytsev · Jun 26, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Multi-lingual Functional Evaluation for Large Language Models
Victor Ojewale, Inioluwa Deborah Raji, Suresh Venkatasubramanian · Jun 25, 2025 · Citations: 0
Multi-lingual competence in large language models is often evaluated via static data benchmarks such as Belebele, M-MMLU and M-GSM.
- Cognitive models can reveal interpretable value trade-offs in language models
Sonia K. Murthy, Rosie Zhao, Jennifer Hu, Sham Kakade, Markus Wulfmeier · Jun 25, 2025 · Citations: 0
- $π$-CoT: Prolog-Initialized Chain-of-Thought Prompting for Multi-Hop Question-Answering
Chao Wan, Albert Gong, Mihir Mishra, Carl-Leander Henneking, Claas Beger · Jun 25, 2025 · Citations: 0
Extensive experiments demonstrate that π-CoT significantly outperforms standard RAG and in-context CoT on multi-hop question-answering benchmarks.
- An Agentic System for Rare Disease Diagnosis with Traceable Reasoning
Weike Zhao, Chaoyi Wu, Yanjie Fan, Xiaoman Zhang, Pengcheng Qiu · Jun 25, 2025 · Citations: 0
Expert Verification Multi Agent
Here we present DeepRare, a multi-agent system for rare disease differential diagnosis decision support powered by large language models, integrating over 40 specialized tools and up-to-date knowledge sources.
- Bridging Compositional and Distributional Semantics: A Survey on Latent Semantic Geometry via AutoEncoder
Yingji Zhang, Danilo S. Carvalho, André Freitas · Jun 25, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality
Baochang Ren, Shuofei Qiao, Da Zheng, Huajun Chen, Ningyu Zhang · Jun 24, 2025 · Citations: 0
Experimental results on three hallucination evaluation datasets and two reasoning evaluation datasets demonstrate that KnowRL effectively mitigates hallucinations in slow-thinking models while maintaining their original strong reasoning…
- Breaking Barriers: Do Reinforcement Post Training Gains Transfer To Unseen Domains?
Chuxuan Hu, Yuxuan Zhu, Antony Kellermann, Caleb Biddulph, Suppakit Waiwitlikhit · Jun 24, 2025 · Citations: 0
- TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems
Christoph Minixhofer, Ondrej Klejch, Peter Bell · Jun 24, 2025 · Citations: 0
Evaluation of Text to Speech (TTS) systems is challenging and resource-intensive.
- LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning
Yuhao Wu, Yushi Bai, Zhiqiang Hu, Roy Ka-Wei Lee, Juanzi Li · Jun 23, 2025 · Citations: 0
Experimental evaluations show that our LongWriter-Zero model, trained from Qwen2.5-32B, consistently outperforms traditional SFT methods on long-form writing tasks, achieving state-of-the-art results across all metrics on WritingBench and…
- Programming by Backprop: An Instruction is Worth 100 Examples When Finetuning LLMs
Jonathan Cook, Silvia Sapora, Arash Ahmadian, Akbir Khan, Tim Rocktaschel · Jun 23, 2025 · Citations: 0
Demonstrations
Though execution of instructions in training data remains less reliable than when instructions are given in-context, our results demonstrate that procedural knowledge can be noisily `programmed' into LLMs through PBB, with important…
- Context Biasing for Pronunciation-Orthography Mismatch in Automatic Speech Recognition
Christian Huber, Alexander Waibel · Jun 23, 2025 · Citations: 0
- Parallel Continuous Chain-of-Thought with Jacobi Iteration
Haoyi Wu, Zhihao Teng, Kewei Tu · Jun 23, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- A Simple "Motivation" Can Enhance Reinforcement Finetuning of Large Reasoning Models
Junjie Zhang, Guozheng Ma, Shunyu Liu, Haoyu Wang, Jiaxing Huang · Jun 23, 2025 · Citations: 0
This motivates us to explore if large reasoning models can benefit from a motivation of the task, i.e., awareness of the reward function, during the reinforcement finetuning process, as we humans sometimes do when learning.
- Improving Black-Box Generative Attacks via Generator Semantic Consistency
Jongoh Jeong, Hunmin Yang, Jaeseok Jeong, Kuk-Jin Yoon · Jun 23, 2025 · Citations: 0
- PDF Retrieval Augmented Question Answering
Thi Thu Uyen Hoang, Viet Anh Nguyen · Jun 22, 2025 · Citations: 0
We provide an in-depth experimental evaluation of our solution, demonstrating its capability to extract accurate information that can be applied to different types of content across PDFs.
- LLM Probability Concentration: How Alignment Shrinks the Generative Horizon
Chenghao Yang, Sida Li, Ari Holtzman · Jun 22, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Accelerating Residual Reinforcement Learning with Uncertainty Estimation
Lakshita Dodeja, Karl Schmeckpeper, Shivam Vats, Thomas Weng, Mingxi Jia · Jun 21, 2025 · Citations: 0
- Towards AI Search Paradigm
Yuchen Li, Hengyi Cai, Rui Kong, Xinran Chen, Jiamin Chen · Jun 20, 2025 · Citations: 0
In this paper, we introduce the AI Search Paradigm, a comprehensive blueprint for next-generation search systems capable of emulating human information processing and decision-making.
- PersonalAI: A Systematic Comparison of Knowledge Graph Storage and Retrieval Approaches for Personalized LLM agents
Mikhail Menschikov, Dmitry Evseev, Victoria Dochkina, Ruslan Kostoev, Ilia Perepechkin · Jun 20, 2025 · Citations: 0
We evaluate our system on three benchmarks: TriviaQA, HotpotQA, DiaASQ and demonstrate that different memory and retrieval configurations yield optimal performance depending on the task.
- Multimodal Fused Learning for Solving the Generalized Traveling Salesman Problem in Robotic Task Planning
Jiaqi Chen, Mingfeng Fan, Xuefeng Zhang, Jingsong Liang, Yuhong Cao · Jun 20, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- DistillNote: Toward a Functional Evaluation Framework of LLM-Generated Clinical Note Summaries
Heloisa Oss Boll, Antonio Oss Boll, Leticia Puttlitz Boll, Ameen Abu Hanna, Iacer Calixto · Jun 20, 2025 · Citations: 0
Expert Verification
This study introduces DistillNote, an evaluation framework for LLM summaries that targets their functional utility by applying the generated summary downstream in a complex clinical prediction task, explicitly quantifying how much…
- Long-Context Generalization with Sparse Attention
Pavlo Vasylenko, Hugo Pitorro, André F. T. Martins, Marcos Treviso · Jun 19, 2025 · Citations: 0
Our empirical evaluation on synthetic tasks and language modeling demonstrates that ASEntmax substantially outperforms softmax, scalable softmax, and fixed-temperature α-entmax baselines, achieving up to 1000\times length extrapolation on…
- A Scoping Review of Synthetic Data Generation by Language Models in Biomedical Research and Application: Data Utility and Quality Perspectives
Hanshu Rao, Weisi Liu, Haohan Wang, I-Chan Huang, Zhe He · Jun 19, 2025 · Citations: 0
Evaluations were heterogeneous: intrinsic metrics (27.1\%), human-in-the-loop assessments (44.1\%), and LLM-based evaluations (13.6\%).
- Measuring Intent Comprehension in LLMs
Nadav Kunievsky, James A. Evans · Jun 19, 2025 · Citations: 0
People judge interactions with large language models (LLMs) as successful when outputs match what they want, not what they type.
- Revela: Dense Retriever Learning via Language Modeling
Fengyu Cai, Tong Chen, Xinran Zhao, Sihao Chen, Hongming Zhang · Jun 19, 2025 · Citations: 0
We evaluate Revela on domain-specific (CoIR), reasoning-intensive (BRIGHT), and general-domain (BEIR) benchmarks across various retriever backbones.
- When Does Divide and Conquer Work for Long Context LLM? A Noise Decomposition Framework
Zhen Xu, Shang Zhu, Jue Wang, Junlin Wang, Ben Athiwaratkun · Jun 19, 2025 · Citations: 0
- OJBench: A Competition Level Code Benchmark For Large Language Models
Zhexu Wang, Yiping Liu, Yejie Wang, Wenyang He, Bofei Gao · Jun 19, 2025 · Citations: 0
- GenRecal: Generation after Recalibration from Large to Small Vision-Language Models
Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Yu-Chiang Frank Wang, Yueh-Hua Wu · Jun 18, 2025 · Citations: 0
- SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling
Md Imbesat Hassan Rizvi, Xiaodan Zhu, Iryna Gurevych · Jun 18, 2025 · Citations: 0
Long Horizon
To address this, we introduce Single-Pass Annotation with Reference-Guided Evaluation (SPARE), a novel structured framework that enables efficient per-step annotation by jointly aligning solution steps to reference solutions and determine…
- DeVisE: Behavioral Testing of Medical Large Language Models
Camila Zurdo Tagliabue, Heloisa Oss Boll, Aykut Erdem, Erkut Erdem, Iacer Calixto · Jun 18, 2025 · Citations: 0
Large language models (LLMs) are increasingly applied in clinical decision support, yet current evaluations rarely reveal whether their outputs reflect genuine medical reasoning or superficial correlations.
- ConLID: Supervised Contrastive Learning for Low-Resource Language Identification
Negar Foroutan, Jakhongir Saydaliev, Ye Eun Kim, Antoine Bosselut · Jun 18, 2025 · Citations: 0
- Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts
Kartik Sharma, Yiqiao Jin, Vineeth Rakesh, Yingtong Dou, Menghai Pan · Jun 18, 2025 · Citations: 0
Red Team
As large language models (LLMs) are deployed in safety-critical settings, it is essential to ensure that their responses comply with safety standards.
- BMFM-RNA: whole-cell expression decoding improves transcriptomic foundation models
Michael M. Danziger, Bharath Dandala, Viatcheslav Gurev, Matthew Madgwick, Sivan Ravid · Jun 17, 2025 · Citations: 0
Models trained with these objectives achieve best overall performance across CZI benchmarks, on zero-shot batch integration and linear probing cell-type annotation.
- LingoLoop Attack: Trapping MLLMs via Linguistic Context and State Entrapment into Endless Loops
Jiyuan Fu, Kaixun Jiang, Lingyi Hong, Jinglun Li, Haijing Guo · Jun 17, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- RedTopic: Toward Topic-Diverse Red Teaming of Large Language Models
Jiale Ding, Xiang Zheng, Yutao Wu, Cong Wang, Wei-Bin Lee · Jun 17, 2025 · Citations: 0
Red Team
It tests LLMs with adversarial prompts to uncover vulnerabilities and improve safety alignment.
- Hope Speech Detection in code-mixed Roman Urdu tweets: A Positive Turn in Natural Language Processing
Muhammad Ahmad, Muhammad Waqas, Ameer Hamza, Ildar Batyrshin, Grigori Sidorov · Jun 17, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents
Jingxu Xie, Dylan Xu, Xuandong Zhao, Dawn Song · Jun 17, 2025 · Citations: 0
Long Horizon
We introduce AgentSynth, a scalable and cost-efficient pipeline for automatically synthesizing high-quality tasks and trajectory datasets for generalist computer-use agents.
- Instruction Following by Principled Boosting Attention of Large Language Models
Vitoria Guardieiro, Avishree Khare, Adam Stein, Eric Wong · Jun 16, 2025 · Citations: 0
Yet in practice these constraints can be violated under long contexts or when user-provided context conflicts with them, creating reliability and safety risks.
- Language Agents for Hypothesis-driven Clinical Decision Making with Reinforcement Learning
David Bani-Harouni, Chantal Pellegrini, Ege Özsoy, Nassir Navab, Matthias Keicher · Jun 16, 2025 · Citations: 0
Expert Verification
In contrast to this, we propose to model clinical decision-making for diagnosis with a hypothesis-driven uncertainty-aware language agent, LA-CDM, that converges towards a diagnosis via repeatedly requesting and interpreting relevant tests.
- DualEdit: Mitigating Safety Fallback in LLM Backdoor Editing via Affirmation-Refusal Regulation
Houcheng Jiang, Zetong Zhao, Junfeng Fang, Haokai Ma, Ruipeng Wang · Jun 16, 2025 · Citations: 0
Safety-aligned large language models (LLMs) remain vulnerable to backdoor attacks.
- Dynamic Reinsurance Treaty Bidding via Multi-Agent Reinforcement Learning
Stella C. Dong, James R. Finlay · Jun 16, 2025 · Citations: 0
Multi Agent
This paper develops a novel multi-agent reinforcement learning (MARL) framework for reinsurance treaty bidding, addressing long-standing inefficiencies in traditional broker-mediated placement processes.