- MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation
Taha Koleilat, Hojat Asgariandehkordi, Omid Nejati Manzari, Berardino Barile, Yiming Xiao · Feb 23, 2026
Medical image segmentation remains challenging due to limited annotations for training, ambiguous anatomical features, and domain shifts.
- Case-Aware LLM-as-a-Judge Evaluation for Enterprise-Scale RAG Systems
Mukul Chhabra, Luigi Medrano, Arush Verma · Feb 23, 2026
Enterprise Retrieval-Augmented Generation (RAG) assistants operate in multi-turn, case-based workflows such as technical support and IT operations, where evaluation must reflect operational constraints, structured identifiers (e.g., error c
- How communicatively optimal are exact numeral systems? Once more on lexicon size and morphosyntactic complexity
Chundra Cathcart, Arne Rubehn, Katja Bocklage, Luca Ciucci, Kellen Parker van Dam · Feb 23, 2026
Recent research argues that exact recursive numeral systems optimize communicative efficiency by balancing a tradeoff between the size of the numeral lexicon and the average morphosyntactic complexity (roughly length in morphemes) of numera
- Natural Language Processing Models for Robust Document Categorization
Radoslaw Roszczyk, Pawel Tecza, Maciej Stodolski, Krzysztof Siwek · Feb 23, 2026
This article presents an evaluation of several machine learning methods applied to automated text classification, alongside the design of a demonstrative system for unbalanced document categorization and distribution.
- No One Size Fits All: QueryBandits for Hallucination Mitigation
Nicole Cho, William Watson, Alec Koppel, Sumitra Ganesh, Manuela Veloso · Feb 23, 2026
Advanced reasoning capabilities in Large Language Models (LLMs) have led to more frequent hallucinations; yet most mitigation work focuses on open-source models for post-hoc detection and parameter editing.
- An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models
Cathy Shyr, Yan Hu, Rory J. Tinker, Thomas A. Cassini, Kevin W. Byram · Feb 23, 2026
Expert Verification
Existing artificial intelligence approaches typically optimize individual components of phenotyping but do not operationalize the full clinical workflow of extracting features from clinical text, standardizing them to Human Phenotype Ontolo
- What Makes a Good Query? Measuring the Impact of Human-Confusing Linguistic Features on LLM Performance
William Watson, Nicole Cho, Sumitra Ganesh, Manuela Veloso · Feb 23, 2026
We operationalize this insight by constructing a 22-dimension query feature vector covering clause complexity, lexical rarity, and anaphora, negation, answerability, and intention grounding, all known to affect human comprehension.
- InterviewSim: A Scalable Framework for Interview-Grounded Personality Simulation
Yu Li, Pranav Narayanan Venkit, Yada Pruksachatkun, Chien-Sheng Wu · Feb 23, 2026
Existing evaluation approaches rely on demographic surveys, personality questionnaires, or short AI-led interviews as proxies, but lack direct assessment against what individuals actually said.
- A Very Big Video Reasoning Suite
Maijunxian Wang, Ruisi Wang, Juyi Lin, Ran Ji, Thaddäus Wiedemer · Feb 23, 2026
We further present VBVR-Bench, a verifiable evaluation framework that moves beyond model-based judging by incorporating rule-based, human-aligned scorers, enabling reproducible and interpretable diagnosis of video reasoning capabilities.
- KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration
Mohammad Amanlou, Erfan Shafiee Moghaddam, Yasaman Amou Jafari, Mahdi Noori, Farhan Farsi · Feb 23, 2026
Results show that KNIGHT enables token- and cost-efficient generation from a reusable graph representation, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting topic-sp
- AdaEvolve: Adaptive LLM Driven Zeroth-Order Optimization
Mert Cemri, Shubham Agrawal, Akshat Gupta, Shu Liu, Audrey Cheng · Feb 23, 2026
The paradigm of automated program generation is shifting from one-shot generation to inference-time search, where Large Language Models (LLMs) function as semantic mutation operators within evolutionary loops.
- To Reason or Not to: Selective Chain-of-Thought in Medical Question Answering
Zaifu Zhan, Min Zeng, Shuang Zhou, Yiran Song, Xiaoyi Chen · Feb 23, 2026
Two open-source LLMs (Llama-3.1-8B and Qwen-2.5-7B) were evaluated on four biomedical QA benchmarks-HeadQA, MedQA-USMLE, MedMCQA, and PubMedQA.
- NanoKnow: How to Know What Your Language Model Knows
Lingwei Gu, Nour Jedidi, Jimmy Lin · Feb 23, 2026
Towards the goal of understanding how knowledge is encoded by LLMs, we release NanoKnow, a benchmark dataset that partitions questions from Natural Questions and SQuAD into splits based on whether their answers are present in nanochat's pre
- BabyLM Turns 4 and Goes Multilingual: Call for Papers for the 2026 BabyLM Workshop
Leshem Choshen, Ryan Cotterell, Mustafa Omer Gul, Jaap Jumelet, Tal Linzen · Feb 23, 2026
For the workshop, we call for papers related to the overall theme of BabyLM, which includes training efficiency, small-scale training datasets, cognitive modeling, model evaluation, and architecture innovation.
- How Retrieved Context Shapes Internal Representations in RAG
Samuel Yeh, Sharon Li · Feb 23, 2026
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by conditioning generation on retrieved external documents, but the effect of retrieved context is often non-trivial.
- Multilingual Large Language Models do not comprehend all natural languages to equal degrees
Natalia Moskvina, Raquel Montero, Masaya Yoshida, Ferdy Hubers, Paolo Morosi · Feb 23, 2026
Large Language Models (LLMs) play a critical role in how humans access information.
- Structured Prompt Language: Declarative Context Management for LLMs
Wen G. Gong · Feb 23, 2026
SPL-flow extends SPL into resilient agentic pipelines with a three-tier provider fallback strategy (Ollama -> OpenRouter -> self-healing retry) fully transparent to the .spl script.
- Entropy in Large Language Models
Marco Scharringhausen · Feb 23, 2026
In this study, the output of large language models (LLM) is considered an information source generating an unlimited sequence of symbols drawn from a finite alphabet.
- Position: General Alignment Has Hit a Ceiling; Edge Alignment Must Be Taken Seriously
Han Bao, Yue Huang, Xiaoda Wang, Zheyuan Zhang, Yujun Zhou · Feb 23, 2026
We take the position that the dominant paradigm of General Alignment, which compresses diverse human values into a single scalar reward, reaches a structural ceiling in settings with conflicting values, plural stakeholders, and irreducible
- AgenticSum: An Agentic Inference-Time Framework for Faithful Clinical Text Summarization
Fahmida Liza Piya, Rahmatollah Beheshti · Feb 23, 2026
We present AgenticSum, an inference-time, agentic framework that separates context selection, generation, verification, and targeted correction to reduce hallucinated content.
- gencat: Generative computerized adaptive testing
Wanyong Feng, Andrew Lan · Feb 23, 2026
Pairwise Preference
We train the model in a two-step process, first via Supervised Fine-Tuning and then via preference optimization for knowledge-response alignment.
- QUIETT: Query-Independent Table Transformation for Robust Reasoning
Gaurav Najpande, Tampu Ravi Kumar, Manan Roy Choudhury, Neha Valeti, Yanjie Fu · Feb 23, 2026
Experiments on four benchmarks, WikiTQ, HiTab, NQ-Table, and SequentialQA show consistent gains across models and reasoning paradigms, with particularly strong improvements on a challenge set of structurally diverse, unseen questions.
- Exploring Anti-Aging Literature via ConvexTopics and Large Language Models
Lana E. Yeganova, Won G. Kim, Shubo Tian, Natalie Xie, Donald C. Comeau · Feb 23, 2026
Common clustering and topic modeling approaches such as K-means or LDA remain sensitive to initialization and prone to local optima, limiting reproducibility and evaluation.
- Cross-lingual Matryoshka Representation Learning across Speech and Text
Yaya Sy, Dioula Doucouré, Christophe Cerisara, Irina Illina · Feb 23, 2026
We introduce large-scale data curation pipelines and new benchmarks, compare modeling strategies, and show that modality fusion within a frozen text Matryoshka model performs best.
- Contextual Safety Reasoning and Grounding for Open-World Robots
Zachary Ravichandran, David Snyder, Alexander Robey, Hamed Hassani, Vijay Kumar · Feb 23, 2026
Web Browsing
Traditional safety approaches enforce fixed constraints in user-specified contexts, limiting their ability to handle the open-ended contextual variability of real-world deployment.
- ReAttn: Improving Attention-based Re-ranking via Attention Re-weighting
Yuxing Tian, Fengran Mo, Weixu Zhang, Yiyan Qi, Jian-Yun Nie · Feb 23, 2026
The strong capabilities of recent Large Language Models (LLMs) have made them highly effective for zero-shot re-ranking task.
- Unlocking Multimodal Document Intelligence: From Current Triumphs to Future Frontiers of Visual Document Retrieval
Yibo Yan, Jiahao Huo, Guanbo Feng, Mingdong Ou, Yi Cao · Feb 23, 2026
We begin by examining the benchmark landscape, and subsequently dive into the methodological evolution, categorizing approaches into three primary aspects: multimodal embedding models, multimodal reranker models, and the integration of Retr
- Assessing Risks of Large Language Models in Mental Health Support: A Framework for Automated Clinical AI Red Teaming
Ian Steenstra, Paola Pedrelli, Weiyan Shi, Stacy Marsella, Timothy W. Bickmore · Feb 23, 2026
Red Team
Large Language Models (LLMs) are increasingly utilized for mental health support; however, current safety benchmarks often fail to detect the complex, longitudinal risks inherent in therapeutic dialogue.
- When Pretty Isn't Useful: Investigating Why Modern Text-to-Image Models Fail as Reliable Training Data Generators
Krzysztof Adamkiewicz, Brian Moser, Stanislav Frolov, Tobias Christian Nauen, Federico Raue · Feb 23, 2026
Recent text-to-image (T2I) diffusion models produce visually stunning images and demonstrate excellent prompt following.
- Janus-Q: End-to-End Event-Driven Trading via Hierarchical-Gated Reward Modeling
Xiang Li, Zikai Wei, Yiyan Qi, Wanyun Zhou, Xiang Liu · Feb 23, 2026
Financial market movements are often driven by discrete financial events conveyed through news, whose impacts are heterogeneous, abrupt, and difficult to capture under purely numerical prediction objectives.
- DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning
Zhongwei Wan, Yun Shen, Zhihao Dou, Donghao Zhou, Yu Zhang · Feb 23, 2026
Experiments on multiple reasoning benchmarks demonstrate consistent improvements in accuracy and pass@k, highlighting the importance of dual-scale diversity for deep exploration in RLVR.
- Denotational Semantics for ODRL: Knowledge-Based Constraint Conflict Detection
Daham Mustafa, Diego Collarana, Yixin Peng, Rafiqul Haque, Christoph Lange-Bever · Feb 23, 2026
We validate it with 154 benchmarks across six knowledge base families (GeoNames, ISO 3166, W3C DPV, a GDPR-derived taxonomy, BCP 47, and ISO 639-3) and four structural KBs targeting adversarial edge cases.
- Axis Decomposition for ODRL: Resolving Dimensional Ambiguity in Policy Constraints through Interval Semantics
Daham Mustafa, Diego Collarana, Yixin Peng, Rafiqul Haque, Christoph Lange-Bever · Feb 23, 2026
For these operands, a single scalar constraint admits one interpretation per axis, making policy evaluation non-deterministic.
- SHIELD: Semantic Heterogeneity Integrated Embedding for Latent Discovery in Clinical Trial Safety Signals
Francois Vandenhende, Anna Georgiou, Theodoros Psaras, Ellie Karekla · Feb 23, 2026
We present SHIELD, a novel methodology for automated and integrated safety signal detection in clinical trials.
- SAMAS: A Spectrum-Guided Multi-Agent System for Achieving Style Fidelity in Literary Translation
Jingzhuo Wu, Jiajun Zhang, Keyan Jin, Dehua Ma, Junbo Wang · Feb 23, 2026
Multi Agent
This limitation stems from the inability of current single-model and static multi-agent systems to perceive and adapt to stylistic variations.
- MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning
Wall Kim, Chaeyoung Song, Hanul Kim · Feb 23, 2026
Recently, TabPFN has gained attention as a foundation model for tabular data.
- Keyboards for the Endangered Idu Mishmi Language
Akhilesh Kakolu Ramarao · Feb 23, 2026
We present a mobile and desktop keyboard suite for Idu Mishmi, an endangered Trans-Himalayan language spoken by approximately 11,000 people in Arunachal Pradesh, India.
- NILE: Formalizing Natural-Language Descriptions of Formal Languages
Tristan Kneisel, Marko Schmellenkamp, Fabian Vehlken, Thomas Zeume · Feb 23, 2026
This is motivated from educational scenarios where learners describe a formal language (presented, e.g., by a finite state automaton, regular expression, pushdown automaton, context-free grammar or in set notation) in natural language, and
- Continuous Telemonitoring of Heart Failure using Personalised Speech Dynamics
Yue Pan, Xingyao Wang, Hanyue Zhang, Liwei Liu, Changxin Li · Feb 23, 2026
Long Horizon
The model's high sensitivity was further corroborated by additional follow-up data, confirming its efficacy in predicting HF deterioration and its potential to secure patient safety in remote, home-based settings.
- KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge
Alex Robertson, Huizhi Liang, Mahbub Gani, Rohit Kumar, Srijith Rajamohan · Feb 23, 2026
Existing benchmarks are limited by static and narrow questions, leading to limited coverage and misleading evaluations.
- Nacrith: Neural Lossless Compression via Ensemble Context Modeling and High-Precision CDF Coding
Roberto Tacconelli · Feb 23, 2026
An out-of-distribution (OOD) evaluation on a document published after the model's training cutoff confirms these gains are not memorization artifacts, achieving 0.723 bpb on unseen text.
- Anatomy of Unlearning: The Dual Impact of Fact Salience and Model Fine-Tuning
Borisiuk Anna, Andrey Savchenko, Alexander Panchenko, Elena Tutubalina · Feb 23, 2026
In this paper, we introduce DUAL (Dual Unlearning Evaluation across Training Stages), a benchmark of 28.6k Wikidata-derived triplets annotated with fact popularity using Wikipedia link counts and LLM-based salience scores.
- Eye-Tracking-while-Reading: A Living Survey of Datasets with Open Library Support
Deborah N. Jakobi, David R. Reich, Paul Prasse, Jana M. Hofmann, Lena S. Bolliger · Feb 23, 2026
Eye-tracking-while-reading corpora are a valuable resource for many different disciplines and use cases.
- DEEP: Docker-based Execution and Evaluation Platform
Sergio Gómez González, Miguel Domingo, Francisco Casacuberta · Feb 23, 2026
Comparative evaluation of several systems is a recurrent task in researching.
- Temporal-Aware Heterogeneous Graph Reasoning with Multi-View Fusion for Temporal Question Answering
Wuzhenghong Wen, Bowen Zhou, Jinwen Huang, Xianjie Wu, Yuwei Sun · Feb 23, 2026
Experiments on multiple TKGQA benchmarks demonstrate consistent improvements over multiple baselines.
- Sculpting the Vector Space: Towards Efficient Multi-Vector Visual Document Retrieval via Prune-then-Merge Framework
Yibo Yan, Mingdong Ou, Yi Cao, Xin Zou, Jiahao Huo · Feb 23, 2026
Visual Document Retrieval (VDR), which aims to retrieve relevant pages within vast corpora of visually-rich documents, is of significance in current multimodal retrieval applications.
- Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pretraining
Jeffrey Li, Josh Gardner, Doug Kang, Fangping Shi, Karanjeet Singh · Feb 23, 2026
This suggests a simple intervention: by taking a Union over different extractors, we can increase the token yield of DCLM-Baseline by up to 71% while maintaining benchmark performance.
- Hyper-KGGen: A Skill-Driven Knowledge Extractor for High-Quality Knowledge Hypergraph Generation
Rizhuo Huang, Yifan Feng, Rundong Xue, Shihui Ying, Jun-Hai Yong · Feb 23, 2026
Expert Verification
Additionally, we present \textbf{HyperDocRED}, a rigorously annotated benchmark for document-level knowledge hypergraph extraction.
- How to Train Your Deep Research Agent? Prompt, Reward, and Policy Optimization in Search-R1
Yinuo Xu, Shuo Lu, Jianjie Cheng, Meng Wang, Qianlong Xie · Feb 23, 2026
Deep Research agents tackle knowledge-intensive tasks through multi-round retrieval and decision-oriented generation.
- Classroom Final Exam: An Instructor-Tested Reasoning Benchmark
Chongyang Gao, Diji Yang, Shuyan Zhou, Xichen Yan, Luchuan Song · Feb 23, 2026
Long Horizon
We introduce \CFE{} (\textbf{C}lassroom \textbf{F}inal \textbf{E}xam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains.
- Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference
Arindam Khaled · Feb 23, 2026
In this work, we propose "Pyramid MoA", a hierarchical Mixture-of-Agents architecture that uses a lightweight Router to dynamically escalate queries only when necessary.
- Can Large Language Models Replace Human Coders? Introducing ContentBench
Michael Haman · Feb 23, 2026
Critique Edit
This paper introduces ContentBench, a public benchmark suite that helps answer this replacement question by tracking how much agreement low-cost LLMs achieve and what they cost on the same interpretive coding tasks.
- PuppetChat: Fostering Intimate Communication through Bidirectional Actions and Micronarratives
Emma Jiren Wang, Siying Hu, Zhicong Lu · Feb 23, 2026
As a primary channel for sustaining modern intimate relationships, instant messaging facilitates frequent connection across distances.
- SenTSR-Bench: Thinking with Injected Knowledge for Time-Series Reasoning
Zelin He, Boran Han, Xiyuan Zhang, Shuai Zhang, Haotian Lin · Feb 23, 2026
As collecting data for knowledge injection fine-tuning is costly, we further leverage a reinforcement learning-based approach with verifiable rewards (RLVR) to elicit knowledge-rich traces without human supervision, then transfer such an in
- OptiRepair: Closed-Loop Diagnosis and Repair of Supply Chain Optimization Models with LLM Agents
Ruicheng Ao, David Simchi-Levi, Xinshang Wang · Feb 23, 2026
Whether AI agents can perform this task remains untested.
- Personalized Prediction of Perceived Message Effectiveness Using Large Language Model Based Digital Twins
Jasmin Han, Janardan Devkota, Joseph Waring, Amanda Luken, Felix Naughton · Feb 23, 2026
Rubric Rating
Perceived message effectiveness (PME) by potential intervention end-users is important for selecting and optimizing personalized smoking cessation intervention messages for mobile health (mHealth) platform delivery.