- Exploring Effective Strategies for Building a User-Configured GPT for Coding Classroom Dialogues
Luwei Bai, Dongkeun Han, Sara Hennessy · Jun 8, 2025 · Citations: 0
- Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test
Xiaoyuan Zhu, Yaowen Ye, Tianyi Qiu, Hanlin Zhu, Sijun Tan · Jun 8, 2025 · Citations: 0
Red Team
To reduce costs or maliciously alter model behaviors, API providers may discreetly serve quantized or fine-tuned variants, which can degrade performance and compromise safety.
- A dependently-typed calculus of event telicity and culminativity
Pavel Kovalev, Carlo Angiuli · Jun 8, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Offline RL by Reward-Weighted Fine-Tuning for Conversation Optimization
Subhojyoti Mukherjee, Viet Dac Lai, Raghavendra Addanki, Ryan Rossi, Seunghyun Yoon · Jun 8, 2025 · Citations: 0
Pairwise Preference
To showcase the value of our approach, we apply it to learning short-horizon question-answering policies of a fixed length, where the agent reasons about potential answers or asks clarifying questions.
- BIS Reasoning 1.0: The First Large-Scale Japanese Benchmark for Belief-Inconsistent Syllogistic Reasoning
Ha-Thanh Nguyen, Hideyuki Tachibana, Chaoran Liu, Qianying Liu, Su Myat Noe · Jun 8, 2025 · Citations: 0
We benchmark a representative suite of cutting-edge models, including OpenAI GPT-5 variants, GPT-4o, Qwen, and prominent Japanese LLMs, under a uniform, zero-shot protocol.
- Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning
Shubham Parashar, Shurui Gui, Xiner Li, Hongyi Ling, Sushil Vemuri · Jun 7, 2025 · Citations: 0
- You Only Fine-tune Once: Many-Shot In-Context Fine-Tuning for Large Language Models
Wenchong He, Liqian Peng, Zhe Jiang, Alex Go · Jun 6, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation
Jingyu Xiao, Man Ho Lam, Ming Wang, Yuxuan Wan, Junliang Liu · Jun 6, 2025 · Citations: 0
However, existing front-end UI code generation benchmarks have the following limitations: (1) While framework-based development becomes predominant in modern front-end programming, current benchmarks fail to incorporate mainstream…
- Can Theoretical Physics Research Benefit from Language Agents?
Sirui Lu, Zhijing Jin, Terry Jingchen Zhang, Pavel Kos, J. Ignacio Cirac · Jun 6, 2025 · Citations: 0
Physics demands approximation judgment, symmetry exploitation, and physical grounding that require AI agents specifically trained on physics reasoning patterns and equipped with physics-aware verification tools.
- Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models
Yingqi Hu, Zhuo Zhang, Jingyuan Zhang, Jinghua Wang, Qifan Wang · Jun 6, 2025 · Citations: 0
These findings highlight concrete privacy risks in FedLLMs and establish a benchmark and evaluation framework for future research on privacy-preserving federated learning.
- Elementary Math Word Problem Generation using Large Language Models
Nimesh Ariyarathne, Harshani Bandara, Yasith Heshan, Omega Gamage, Surangika Ranathunga · Jun 6, 2025 · Citations: 0
Unlike the existing LLM-based solutions for MWP generation, we carried out an extensive set of experiments involving different LLMs, prompting strategies, techniques to improve the diversity of MWPs, as well as techniques that employ human…
- Comparative Analysis of Modern Machine Learning Models for Retail Sales Forecasting
Luka Hobor, Mario Brcic, Lidija Polutnik, Ante Kapetanovic · Jun 6, 2025 · Citations: 0
- Cross-lingual Collapse: How Language-Centric Foundation Models Shape Reasoning in Large Language Models
Cheonbok Park, Jeonghoon Kim, Joosung Lee, Sanghwan Bae, Jaegul Choo · Jun 6, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Generalized Incremental Learning under Concept Drift across Evolving Data Streams
En Yu, Jie Lu, Guangquan Zhang · Jun 6, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- When to use Graphs in RAG: A Comprehensive Analysis for Graph Retrieval-Augmented Generation
Zhishang Xiang, Chuanjie Wu, Qinggang Zhang, Shengyuan Chen, Zijin Hong · Jun 6, 2025 · Citations: 0
To address this, we propose GraphRAG-Bench, a comprehensive benchmark designed to evaluate GraphRAG models onboth hierarchical knowledge retrieval and deep contextual reasoning.
- Voice Impression Control in Zero-Shot TTS
Kenichi Fujita, Shota Horiguchi, Yusuke Ijima · Jun 6, 2025 · Citations: 0
The results of both objective and subjective evaluations have demonstrated our method's effectiveness in impression control.
- FictionalQA: A Dataset for Studying Memorization and Knowledge Acquisition
John Kirchenbauer, Janny Mongkolsupawan, Yuxin Wen, Tom Goldstein, Daphne Ippolito · Jun 5, 2025 · Citations: 0
- MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark
Junjie Xing, Yeye He, Mengyu Zhou, Haoyu Dong, Shi Han · Jun 5, 2025 · Citations: 0
Although LLMs have shown remarkable progress in working with tables (e.g., in spreadsheet and database copilot scenarios), comprehensive benchmarking of such capabilities remains limited.
- Toward Data Systems That Are Business Semantic Centric and AI Agents Assisted
Cecil Pang · Jun 5, 2025 · Citations: 0
- Search Arena: Analyzing Search-Augmented LLMs
Mihran Miroyan, Tsung-Han Wu, Logan King, Tianle Li, Jiayi Pan · Jun 5, 2025 · Citations: 0
Pairwise Preference
In this work, we introduce Search Arena, a crowd-sourced, large-scale, human-preference dataset of over 24,000 paired multi-turn user interactions with search-augmented LLMs.
- Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay
Yifan Sun, Jingyan Shen, Yibin Wang, Tianyu Chen, Zhendong Wang · Jun 5, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Resisting Contextual Interference in RAG via Parametric-Knowledge Reinforcement
Chenyu Lin, Yilin Wen, Du Su, Hexiang Tan, Fei Sun · Jun 5, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Sensory-Motor Control with Large Language Models via Iterative Policy Refinement
Jônata Tyska Carvalho, Stefano Nolfi · Jun 5, 2025 · Citations: 0
We propose a method that enables large language models (LLMs) to control embodied agents through the generation of control policies that directly map continuous observation vectors to continuous action vectors.
- EHR2Path: Scalable Modeling of Longitudinal Patient Pathways from Multimodal Electronic Health Records
Chantal Pellegrini, Ege Özsoy, David Bani-Harouni, Matthias Keicher, Nassir Navab · Jun 5, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Evaluating Vision-Language and Large Language Models for Automated Student Assessment in Indonesian Classrooms
Nurul Aisyah, Muhammad Dehan Al Kautsar, Arif Hidayat, Raqib Chowdhury, Fajri Koto · Jun 5, 2025 · Citations: 0
Rubric Rating
Assessment tasks include grading and generating personalized Indonesian feedback guided by rubric-based evaluation.
- MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark
Dingdong Wang, Junan Li, Jincenzi Wu, Dongchao Yang, Xueyuan Chen · Jun 5, 2025 · Citations: 0
- Enhancing Delta Compression in LLMs via SVD-based Quantization Error Minimization
Boya Xiong, Shuo Wang, Weifeng Ge, Guanhua Chen, Yun Chen · Jun 5, 2025 · Citations: 0
Extensive experiments confirm PrinMix performs well: for 7B LLMs, PrinMix outperforms SOTA Delta-CoMe on challenging benchmarks by 22.3% on AIME2024 and 6.1% on GQA.
- LESS: Large Language Model Enhanced Semi-Supervised Learning for Speech Foundational Models Using in-the-wild Data
Wen Ding, Fan Qian · Jun 5, 2025 · Citations: 0
Across Mandarin ASR and Spanish-to-English AST evaluations, LESS delivers consistent gains, with an absolute Word Error Rate reduction of 3.8% on WenetSpeech, and BLEU score increase of 0.8 and 0.7, achieving 34.0 on Callhome and 64.7 on…
- "Don't Do That!": Guiding Embodied Systems through Large Language Model-based Constraint Generation
Amin Seffo, Aladin Djuhera, Masataro Asai, Holger Boche · Jun 4, 2025 · Citations: 0
Web Browsing
Recent advancements in large language models (LLMs) have spurred interest in robotic navigation that incorporates complex spatial, mathematical, and conditional constraints from natural language into the planning problem.
- Watermarking Degrades Alignment in Language Models: Analysis and Mitigation
Apurv Verma, NhatHai Phan, Shubhendu Trivedi · Jun 4, 2025 · Citations: 0
In practice, sampling as few as two to four candidates largely restores unwatermarked alignment performance in truthfulness, safety, and helpfulness, without hurting watermark detection.
- Learning to Diagnose Privately: DP-Powered LLMs for Radiology Report Classification
Payel Bhattacharjee, Fengwei Tian, Geoffrey D. Rubin, Joseph Y. Lo, Nirav Merchant · Jun 4, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Toward Beginner-Friendly LLMs for Language Learning: Controlling Difficulty in Conversation
Meiqing Jin, Liam Dugan, Chris Callison-Burch · Jun 4, 2025 · Citations: 0
We further introduce a new token-level evaluation metric, Token Miss Rate (TMR), that quantifies the proportion of incomprehensible tokens per utterance and correlates strongly with human judgments.
- High Accuracy, Less Talk (HALT): Reliable LLMs through Capability-Aligned Finetuning
Tim Franzmeyer, Archie Sravankumar, Lijuan Liu, Yuning Mao, Rui Hou · Jun 4, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models
Zhaolu Kang, Junhao Gong, Jiaxu Yan, Wanke Xia, Yian Wang · Jun 4, 2025 · Citations: 0
Expert Verification
However, current benchmarks for evaluating MLLMs primarily emphasize general knowledge and vertical step-by-step reasoning typical of STEM disciplines, while overlooking the distinct needs and potential of the Humanities and Social Sciences…
- EuroGEST: Investigating gender stereotypes in multilingual language models
Jacqueline Rowe, Mateusz Klimaszewski, Liane Guillou, Shannon Vallor, Alexandra Birch · Jun 4, 2025 · Citations: 0
Large language models increasingly support multiple languages, yet most benchmarks for gender bias remain English-centric.
- AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Maintenance
Dhaval Patel, Shuxin Lin, James Rayfield, Nianjun Zhou, Chathurangi Shyalika · Jun 4, 2025 · Citations: 0
In this paper, we introduce AssetOpsBench, a unified framework for orchestrating and evaluating domain-specific agents for Industry 4.0.
- CyclicReflex: Improving Reasoning Models via Cyclical Reflection Token Scheduling
Chongyu Fan, Yihua Zhang, Jinghan Jia, Alfred Hero, Sijia Liu · Jun 4, 2025 · Citations: 0
Long Horizon
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Go-Browse: Training Web Agents with Structured Exploration
Apurva Gandhi, Graham Neubig · Jun 4, 2025 · Citations: 0
Web Browsing
To address this, we propose Go-Browse, a method for automatically collecting diverse and realistic web agent data at scale through structured exploration of web environments.
- Beyond Memorization: A Rigorous Evaluation Framework for Medical Knowledge Editing
Shigeng Chen, Linhao Luo, Zhangchi Qiu, Yanan Cao, Carl Yang · Jun 4, 2025 · Citations: 0
Despite the effectiveness in general-domain benchmarks, their applicability to complex medical domain remains largely unexplored.
- ProRank: Prompt Warmup via Reinforcement Learning for Small Language Models Reranking
Xianming Li, Aamir Shakir, Rui Huang, Julius Lipp, Benjamin Clavié · Jun 4, 2025 · Citations: 0
Notably, our 0.5B ProRank even surpasses powerful LLM reranking models on the BEIR benchmark, establishing that properly trained SLMs can achieve superior document reranking performance while maintaining computational efficiency.
- OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models
Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu · Jun 3, 2025 · Citations: 0
- Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback
Xiaoying Zhang, Yipeng Zhang, Hao Sun, Kaituo Feng, Chaochao Lu · Jun 3, 2025 · Citations: 0
Critique Edit
We show that plateaued RL models can successfully refine failed solutions when given natural language critiques.
- DiaBlo: Diagonal Blocks Are Sufficient For Finetuning
Selcuk Gurses, Aozhong Zhang, Yanxia Deng, Xun Dong, Xin Li · Jun 3, 2025 · Citations: 0
Through extensive experiments across a range of tasks, including commonsense reasoning, arithmetic reasoning, code generation, and safety alignment, we show that fine-tuning only diagonal blocks is sufficient for strong and consistent…
- PhysGaia: A Physics-Aware Benchmark with Multi-Body Interactions for Dynamic Novel View Synthesis
Mijeong Kim, Gunhee Kim, Jungyoon Choi, Wonjae Roh, Bohyung Han · Jun 3, 2025 · Citations: 0
- Machine Learning for Enhancing Deliberation in Online Political Discussions and Participatory Processes: A Survey
Maike Behrendt, Stefan Sylvius Wagner, Carina Weinmann, Marike Bormann, Mira Warne · Jun 3, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Automated Web Application Testing: End-to-End Test Case Generation with Large Language Models and Screen Transition Graphs
Nguyen-Khang Le, Quan Minh Bui, Minh Ngoc Nguyen, Hiep Nguyen, Trung Vo · Jun 3, 2025 · Citations: 0
Web Browsing
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage
Kalyan Nakka, Nitesh Saxena · Jun 3, 2025 · Citations: 0
Red Team
The inherent risk of generating harmful and unsafe content by Large Language Models (LLMs), has highlighted the need for their safety alignment.
- Esoteric Language Models: Bridging Autoregressive and Masked Diffusion LLMs
Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh · Jun 2, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- MLorc: Momentum Low-rank Compression for Memory Efficient Large Language Model Adaptation
Wei Shen, Zhang Yaxiang, Minhui Huang, Mengfan Xu, Jiawei Zhang · Jun 2, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- iQUEST: An Iterative Question-Guided Framework for Knowledge Base Question Answering
Shuai Wang, Yinan Yu · Jun 2, 2025 · Citations: 0
Long Horizon
Detailed experiments demonstrate the consistent improvement delivered by iQUEST across four benchmark datasets and four LLMs.
- Synthesis of discrete-continuous quantum circuits with multimodal diffusion models
Florian Fürrutter, Zohim Chandani, Ikko Hamamura, Hans J. Briegel, Gorka Muñoz-Gil · Jun 2, 2025 · Citations: 0
We benchmark the model over different experiments, analyzing the method's accuracy across varying qubit counts and circuit depths, showcasing the ability of the method to outperform existing approaches in gate counts and under noisy conditi