- BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text
Jiageng Wu, Bowen Gu, Ren Zhou, Kevin Xie, Doug Snyder · Apr 28, 2025 · Citations: 0
However, benchmarking on large-scale real-world data such as electronic health records (EHRs) is critical, as clinical decisions are directly informed by these sources, yet current evaluations remain limited.
- A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage
Rui Xin, Niloofar Mireshghallah, Shuyue Stella Li, Michael Duan, Hyunwoo Kim · Apr 28, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Reshaping MOFs text mining with a dynamic multi-agents framework of large language model
Zuhong Lin, Daoyuan Ren, Kai Ran, Jing Sun, Songlin Yu · Apr 26, 2025 · Citations: 0
Multi Agent
We present MOFh6, a large language model driven system that reads raw articles or crystal codes and converts them into standardized synthesis tables.
- Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition
Zheng Hui, Xiaokai Wei, Yexi Jiang, Kevin Gao, Chen Wang · Apr 26, 2025 · Citations: 0
Pairwise Preference Multi Agent
These domains typically involve fixed content and passive consumption, where user preferences can be matched by genre or theme.
- Reason Like a Radiologist: Chain-of-Thought and Reinforcement Learning for Verifiable Report Generation
Peiyuan Jing, Kinhei Lee, Zhenxuan Zhang, Huichi Zhou, Zhengqing Yuan · Apr 25, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Comparing Uncertainty Measurement and Mitigation Methods for Large Language Models: A Systematic Review
Toghrul Abbasli, Kentaroh Toyoda, Yuan Wang, Leon Witt, Muhammad Asif Ali · Apr 25, 2025 · Citations: 0
While some of these methods have been adapted for LLMs, the literature lacks an in-depth analysis of their effectiveness and does not offer a comprehensive benchmark to enable insightful comparison among existing solutions.
- How much does context affect the accuracy of AI health advice?
Prashant Garg, Thiemo Fetzer · Apr 25, 2025 · Citations: 0
English-language performance does not reliably generalise across contexts, underscoring the need for multilingual, domain-specific evaluation before deployment in public-health communication.
- Unsupervised Corpus Poisoning Attacks in Continuous Space for Dense Retrieval
Yongkang Li, Panagiotis Eustratiadis, Simon Lupart, Evangelos Kanoulas · Apr 24, 2025 · Citations: 0
- Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation
Thomas F Burns, Letitia Parcalabescu, Stephan Wäldchen, Michael Barlow, Gregor Ziegltrum · Apr 24, 2025 · Citations: 0
A comparison on German-language benchmarks, including MMMLU, shows significant performance gains of Aleph-Alpha-GermanWeb over FineWeb2 alone.
- RaPA: Enhancing Transferable Targeted Attacks via Random Parameter Pruning
Tongrui Su, Qingbin Li, Shengyu Zhu, Wei Chen, Xueqi Cheng · Apr 24, 2025 · Citations: 0
- FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation
Yulia Otmakhova, Hung Thinh Truong, Rahmad Mahendra, Zenan Zhai, Rongxin Zhu · Apr 24, 2025 · Citations: 0
We present FLUKE (Framework for LingUistically-driven and tasK-agnostic robustness Evaluation), a framework for assessing model robustness through systematic minimal variations of test data.
- Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
Minju Seo, Jinheon Baek, Seongyun Lee, Sung Ju Hwang · Apr 24, 2025 · Citations: 0
Multi Agent
Inspired by this, we introduce PaperCoder, a multi-agent LLM framework that transforms machine learning papers into operational code repositories.
- ReCellTy: Domain-Specific Knowledge Graph Retrieval-Augmented LLMs Reasoning Workflow for Single-Cell Annotation
Dezheng Han, Yibin Jia, Ruxiao Chen, Wenjie Han, Shuaishuai Guo · Apr 24, 2025 · Citations: 0
Compared to general-purpose LLMs, our method improves human evaluation scores by up to 0.21 and semantic similarity by 6.1% across multiple tissue types, while more closely aligning with the cognitive logic of manual annotation.
- GeneMamba: An Efficient and Effective Foundation Model on Single Cell Data
Cong Qi, Hanzhang Fang, Siqi Jiang, Xun Song, Tianxing Hu · Apr 22, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- ConformalNL2LTL: Translating Natural Language Instructions into Temporal Logic Formulas with Conformal Correctness Guarantees
David Smith Sundarsingh, Jun Wang, Jyotirmoy V. Deshmukh, Yiannis Kantaros · Apr 22, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- TrustGeoGen: Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving
Daocheng Fu, Jianlong Chen, Renqiu Xia, Zijun Chen, Qi Liu · Apr 22, 2025 · Citations: 0
Furthermore, we propose "connection thinking" to bridge the semantic gap between rigid formal logic and fluent human-like reasoning, ensuring coherent logical transitions.
- DIP: Efficient Large Multimodal Model Training with Dynamic Interleaved Pipeline
Zhenliang Xue, Hanpeng Hu, Xing Chen, Yimin Jiang, Yixin Song · Apr 19, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Leakage and Interpretability in Concept-Based Models
Enrico Parisini, Tapabrata Chakraborti, Chris Harbron, Ben D. MacArthur, Christopher R. S. Banerji · Apr 18, 2025 · Citations: 0
- Scalable Multi-Task Learning through Spiking Neural Networks with Adaptive Task-Switching Policy for Intelligent Autonomous Agents
Rachmad Vidya Wicaksana Putra, Avaneesh Devkota, Muhammad Shafique · Apr 18, 2025 · Citations: 0
- Cost-of-Pass: An Economic Framework for Evaluating Language Models
Mehmet Hamza Erol, Batu El, Mirac Suzgun, Mert Yuksekgonul, James Zou · Apr 17, 2025 · Citations: 0
We then define the frontier cost-of-pass: the minimum cost-of-pass achievable across available models or the human-expert(s), using the approx.
- Evaluating the Diversity and Quality of LLM Generated Content
Alexander Shypula, Shuo Li, Botong Zhang, Vishakh Padmakumar, Kayo Yin · Apr 16, 2025 · Citations: 0
Pairwise Preference
Recent work suggests that preference-tuning techniques -- such as Reinforcement Learning from Human Feedback (RLHF) methods like PPO and GRPO, as well as alternatives like DPO -- reduce diversity, creating a dilemma given that these models…
- Diffusion Generative Recommendation with Continuous Tokens
Haohao Qu, Shanru Lin, Yujuan Ding, Yiqi Wang, Wenqi Fan · Apr 16, 2025 · Citations: 0
Pairwise Preference
Specifically, ContRec consists of two key modules: a sigma-VAE Tokenizer, which encodes users/items with continuous tokens; and a Dispersive Diffusion module, which captures implicit user preference.
- arXiv2Table: Toward Realistic Benchmarking and Evaluation for LLM-Based Literature-Review Table Generation
Weiqi Wang, Jiefu Ou, Yangqiu Song, Benjamin Van Durme, Daniel Khashabi · Apr 14, 2025 · Citations: 0
Pairwise Preference
Building on recent work (Newman et al., 2024), we move beyond oracle settings by (i) simulating well-specified yet schema-agnostic user demands that avoid leaking gold column names or values, (ii) explicitly modeling retrieval noise via…
- TWSSenti: A Novel Hybrid Framework for Topic-Wise Sentiment Analysis on Social Media Using Transformer Models
Aish Albladi, Md Kaosar Uddin, Minarul Islam, Cheryl Seals · Apr 14, 2025 · Citations: 0
- Offline Dynamic Inventory and Pricing Strategy: Addressing Censored and Dependent Demand
Korel Gundem, Zhengling Qi · Apr 14, 2025 · Citations: 0
- AgentA/B: Automated and Scalable Web A/BTesting with Interactive LLM Agents
Yuxuan Lu, Ting-Yao Hsu, Hansu Gu, Limeng Cui, Yaochen Xie · Apr 13, 2025 · Citations: 0
Long Horizon
Yet, traditional A/B testing remains constrained by its dependence on the large-scale and live traffic of human participants, and the long time of waiting for the testing result.
- Dominated Actions in Imperfect-Information Games
Sam Ganzfried · Apr 13, 2025 · Citations: 0
- Adaptive Insurance Reserving with CVaR-Constrained Reinforcement Learning under Macroeconomic Regimes
Stella C. Dong · Apr 13, 2025 · Citations: 0
A Proximal Policy Optimization (PPO) agent is trained using a risk-sensitive reward that penalizes reserve shortfall, capital inefficiency, and breaches of a volatility-adjusted solvency floor, with tail risk explicitly controlled through…
- Linguistic Comparison of AI- and Human-Written Responses to Online Mental Health Queries
Koustuv Saha, Yoshee Jain, Violeta J. Rodriguez, Munmun De Choudhury · Apr 12, 2025 · Citations: 0
Although genAI shows promise in delivering immediate and personalized responses, its effectiveness in replicating the nuanced, experience-based support of human peers remains an open question.
- BioChemInsight: An Online Platform for Automated Extraction of Chemical Structures and Activity Data from Patents
Zhe Wang, Fangtian Fu, Wei Zhang, Lige Yan, Nan Li · Apr 12, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems
Zixuan Ke, Fangkai Jiao, Yifei Ming, Xuan-Phi Nguyen, Austin Xu · Apr 12, 2025 · Citations: 0
Tool Use
In this survey, we categorize existing methods along two orthogonal dimensions: (1) Regimes, which define the stage at which reasoning is achieved (either at inference time or through dedicated training); and (2) Architectures, which…
- MCP Bridge: A Lightweight, LLM-Agnostic RESTful Proxy for Model Context Protocol Servers
Arash Ahmadi, Sarah Sharif, Yaser M. Banad · Apr 11, 2025 · Citations: 0
- Generating Fine Details of Entity Interactions
Xinyi Gu, Jiayuan Mao · Apr 11, 2025 · Citations: 0
Critique Edit
However, images should also encapsulate rich interactions between objects, where existing models often fall short, likely due to limited training data and benchmarks for rare interactions.
- Automating quantum feature map design via large language models
Kenya Sakka, Kosuke Mitarai, Keisuke Fujii · Apr 10, 2025 · Citations: 0
- CAReDiO: Cultural Alignment via Representativeness and Distinctiveness Guided Data Optimization
Jing Yao, Xiaoyuan Yi, Jindong Wang, Zhicheng Dou, Xing Xie · Apr 9, 2025 · Citations: 0
Extensive experiments on 15 cultures demonstrate that CAReDiO can create high-quality data with richer cultural information and enable efficient alignment of small open-source or large proprietary LLMs with as few as 200 training samples,…
- Estimating Item Difficulty Using Large Language Models and Tree-Based Machine Learning Algorithms
Pooya Razavi, Sonya Powers · Apr 9, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Can LLMs Simulate Personas with Reversed Performance? A Systematic Investigation for Counterfactual Instruction Following in Math Reasoning Context
Sai Adith Senthil Kumar, Hao Yan, Saipavan Perepa, Murong Yue, Ziyu Yao · Apr 8, 2025 · Citations: 0
- Don't Let It Hallucinate: Premise Verification via Retrieval-Augmented Logical Reasoning
Yuehan Qin, Shawn Li, Yi Nian, Xinyan Velocity Yu, Yue Zhao · Apr 8, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- BiasCause: Evaluate Socially Biased Causal Reasoning of Large Language Models
Tian Xie, Tongxin Yin, Vaishakh Keshava, Xueru Zhang, Siddhartha Reddy Jonnalagadda · Apr 8, 2025 · Citations: 0
- SkillFlow: Scalable and Efficient Agent Skill Retrieval System
Fangzhou Li, Pagkratios Tagkopoulos, Ilias Tagkopoulos · Apr 8, 2025 · Citations: 0
We present SkillFlow, the first multi-stage retrieval pipeline designed for agent skill discovery, framing skill acquisition as an information retrieval problem over a corpus of ~36K community-contributed SKILL.md definitions indexed from…
- NativQA Framework: Enabling LLMs and VLMs with Native, Local, and Everyday Knowledge
Firoj Alam, Md Arid Hasan, Sahinur Rahman Laskar, Mucahid Kutlu, Kareem Darwish · Apr 8, 2025 · Citations: 0
Web Browsing
The developed resources can be used for LLMs benchmarking and further fine-tuning.
- Pretraining Language Models for Diachronic Linguistic Change Discovery
Elisabeth Fittschen, Sabrina Li, Tom Lippincott, Leshem Choshen, Craig Messner · Apr 7, 2025 · Citations: 0
This has engendered growing interest in their use in humanistic disciplines, such as historical linguistics and literary studies.
- SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models
Justus Westerhoff, Erblina Purelku, Jakob Hackstein, Jonas Loos, Leo Pinetzki · Apr 7, 2025 · Citations: 0
- Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models
Yubo Li, Xiaobin Shen, Xinyu Yao, Xueying Ding, Yidi Miao · Apr 7, 2025 · Citations: 0
Red Team
We organize existing benchmarks and datasets into coherent categories reflecting the evolving landscape of multi-turn dialogue evaluation, and review a broad spectrum of enhancement methodologies, including model-centric strategies…
- Causal Retrieval with Semantic Consideration
Hyunseo Shin, Wonseok Hwang · Apr 7, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Splits! Flexible Sociocultural Linguistic Investigation at Scale
Eylon Caplan, Tania Chakraborty, Dan Goldwasser · Apr 6, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Structured Legal Document Generation in India: A Model-Agnostic Wrapper Approach with VidhikDastaavej
Shubham Kumar Nigam, Balaramamahanthi Deepak Patnaik, Noel Shallum, Kripabandhu Ghosh, Arnab Bhattacharya · Apr 4, 2025 · Citations: 0
Comprehensive evaluation, including lexical, semantic, LLM-based, and expert-driven assessments with inter-annotator agreement, shows that the wrapper substantially improves factual accuracy, coherence, and completeness compared to…
- Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning
Julian Minder, Clément Dumas, Caden Juang, Bilal Chugtai, Neel Nanda · Apr 3, 2025 · Citations: 0
Pairwise Preference
Using the BatchTopK crosscoder, we successfully identify a set of chat-specific latents that are both interpretable and causally effective, representing concepts such as false information and personal question, along with multiple…
- Cultural Biases of Large Language Models and Humans in Historical Interpretation
Fabio Celli, Georgios Spathulas · Apr 3, 2025 · Citations: 0
This paper compares historical annotations by humans and Large Language Models.
- AnesSuite: A Comprehensive Benchmark and Dataset Suite for Anesthesiology Reasoning in LLMs
Xiang Feng, Wentao Jiang, Zengmao Wang, Yong Luo, Pingbo Xu · Apr 3, 2025 · Citations: 0
- One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image
Ezzeldin Shereen, Dan Ristea, Shae McFadden, Burak Hasircioglu, Vasilios Mavroudis · Apr 2, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Chain of Correction for Full-text Speech Recognition with Large Language Models
Zhiyuan Tang, Dong Wang, Zhikai Zhou, Yong Liu, Shen Huang · Apr 2, 2025 · Citations: 0
- Compositional-ARC: Assessing Systematic Generalization in Abstract Spatial Reasoning
Philipp Mondorf, Shijia Zhou, Monica Riedler, Barbara Plank · Apr 2, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning with Large Language Models
Xiaoke Huang, Juncheng Wu, Hui Liu, Xianfeng Tang, Yuyin Zhou · Apr 1, 2025 · Citations: 0
Our evaluation across diverse medical tasks demonstrates that test-time scaling consistently enhances medical reasoning, enabling lightweight fine-tuned models under 10B parameters to establish new state-of-the-art performance, while our…
- Benchmarking NLP-supported Language Sample Analysis for Swiss Children's Speech
Anja Ryser, Yingqiang Gao, Sarah Ebling · Apr 1, 2025 · Citations: 0
This preliminary study aims to identify optimal practices that support speech-language pathologists in diagnosing DLD more efficiently with active involvement of human specialists.
- Impact of Data Duplication on Deep Neural Network-Based Image Classifiers: Robust vs. Standard Models
Alireza Aghabagherloo, Aydin Abadi, Sumanta Sarkar, Vishnu Asutosh Dasu, Bart Preneel · Apr 1, 2025 · Citations: 0
- Efficient Construction of Model Family through Progressive Training Using Model Expansion
Kazuki Yano, Sho Takase, Sosuke Kobayashi, Shun Kiyono, Jun Suzuki · Apr 1, 2025 · Citations: 0