- Eigenmood Space: Uncertainty-Aware Spectral Graph Analysis of Psychological Patterns in Classical Persian Poetry
Kourosh Shahnazari, Seyed Moein Ayyoubzadeh, Mohammadali Keshtparvar · Feb 18, 2026
The resulting framework supports scalable, auditable digital-humanities analysis while preserving interpretive caution by propagating uncertainty from verse-level evidence to poet-level inference.
- When Semantic Overlap Is Not Enough: Cross-Lingual Euphemism Transfer Between Turkish and English
Hasan Can Biyik, Libby Barak, Jing Peng, Anna Feldman · Feb 18, 2026
Euphemisms substitute socially sensitive expressions, often softening or reframing meaning, and their reliance on cultural and pragmatic context complicates modeling across languages.
- ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders
Ofer Meshi, Krisztian Balog, Sally Goldman, Avi Caciularu, Guy Tennenholtz · Feb 18, 2026
We introduce ConvApparel, a new dataset of human-AI conversations designed to address this gap.
- MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation
Iman Ahmadi, Mehrshad Taji, Arad Mahdinezhad Kashani, AmirHossein Jadidi, Saina Kashani · Feb 18, 2026
Multi Agent
MALLVI presents a Multi Agent Large Language and Vision framework that enables closed-loop feedback driven robotic manipulation.
- SimToolReal: An Object-Centric Policy for Zero-Shot Dexterous Tool Manipulation
Kushal Kedia, Tyler Ga Wei Lum, Jeannette Bohg, C. Karen Liu · Feb 18, 2026
The ability to manipulate tools significantly expands the set of tasks a robot can perform.
- Meenz bleibt Meenz, but Large Language Models Do Not Speak Its Dialect
Minh Duc Bui, Manuel Mager, Peter Herbert Kann, Katharina von der Wense · Feb 18, 2026
We introduce a digital dictionary-an NLP-ready dataset derived from an existing resource (Schramm, 1966)-to support researchers in modeling and benchmarking the language.
- BanglaSummEval: Reference-Free Factual Consistency Evaluation for Bangla Summarization
Ahmed Rafid, Rumman Adib, Fariya Ahmed, Ajwad Abrar, Mohammed Saidul Islam · Feb 18, 2026
However, most existing evaluation metrics overlook Bangla, a widely spoken yet under-resourced language, and often depend on reference summaries.
- Training Large Reasoning Models Efficiently via Progressive Thought Encoding
Zeliang Zhang, Xiaodong Liu, Hao Cheng, Hao Sun, Chenliang Xu · Feb 18, 2026
Experiments on three models, including Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct, and DeepSeek-R1-Distill-Llama-8B, on six widely used challenging mathematical benchmarks show consistent gains: our method achieves +19.3% improvement over LoR
- Claim Automation using Large Language Model
Zhengda Mo, Zhiyu Quan, Eli O'Donohue, Kaiwen Zhong · Feb 18, 2026
We assess this module using a multi-dimensional evaluation framework that combines automated semantic similarity metrics with human evaluation, enabling a rigorous examination of both practical utility and predictive accuracy.
- IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages
Priyaranjan Pattnayak, Sanchari Chowdhuri · Feb 18, 2026
Red Team
Safety alignment of large language models (LLMs) is mostly evaluated in English and contract-bound, leaving multilingual vulnerabilities understudied.
- Hybrid-Gym: Training Coding Agents to Generalize Across Tasks
Yiqing Xie, Emmy Liu, Gaokai Zhang, Nachiket Kotalwar, Shubham Gandhi · Feb 18, 2026
When assessing the quality of coding agents, predominant benchmarks focus on solving single issues on GitHub, such as SWE-Bench.
- One-step Language Modeling via Continuous Denoising
Chanhyuk Lee, Jaehoon Yoo, Manan Agarwal, Sheel Shah, Jerry Huang · Feb 18, 2026
Language models based on discrete diffusion have attracted widespread interest for their potential to provide faster generation than autoregressive models.
- Evaluating Monolingual and Multilingual Large Language Models for Greek Question Answering: The DemosQA Benchmark
Charalampos Mastrokostas, Nikolaos Giarelis, Nikos Karacapilidis · Feb 18, 2026
In this study, we address this research gap in Greek QA by contributing: (i) DemosQA, a novel dataset, which is constructed using social media user questions and community-reviewed answers to better capture the Greek social and cultural zei
- References Improve LLM Alignment in Non-Verifiable Domains
Kejian Shi, Yixin Liu, Peifeng Wang, Alexander R. Fabbri, Shafiq Joty · Feb 18, 2026
First, we design evaluation protocols that enhance LLM-based evaluators for LLM alignment using reference outputs.
- Better Think Thrice: Learning to Reason Causally with Double Counterfactual Consistency
Victoria Lin, Xinnuo Xu, Rachel Lawrence, Risa Ueno, Amit Sharma · Feb 18, 2026
Despite their strong performance on reasoning benchmarks, large language models (LLMs) have proven brittle when presented with counterfactual questions, suggesting weaknesses in their causal reasoning ability.
- Omitted Variable Bias in Language Models Under Distribution Shift
Victoria Lin, Louis-Philippe Morency, Eli Ben-Michael · Feb 18, 2026
Importantly, we identify that the resulting omitted variable bias from unobserved variables can compromise both evaluation and optimization in language models.
- Reinforced Fast Weights with Next-Sequence Prediction
Hee Seung Hwang, Xindi Wu, Sanghyuk Chun, Olga Russakovsky · Feb 18, 2026
Fast weight architectures offer a promising alternative to attention-based transformers for long-context modeling by maintaining constant memory overhead regardless of context length.
- Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents
Wenxuan Ding, Nicholas Tomlin, Greg Durrett · Feb 18, 2026
Each problem has latent environment state that can be reasoned about via a prior which is passed to the LLM agent.
- Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens
Potsawee Manakul, Woody Haosheng Gan, Martijn Bartelds, Guangzhi Sun, William Held · Feb 18, 2026
Current audio language models are predominantly text-first, either extending pre-trained text LLM backbones or relying on semantic-only audio tokens, limiting general audio modeling.
- Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment
Yuyan Bu, Xiaohao Liu, ZhaoXing Ren, Yaodong Yang, Juntao Dai · Feb 18, 2026
Pairwise Preference
The widespread deployment of large language models (LLMs) across linguistic communities necessitates reliable multilingual safety alignment.
- Quecto-V1: Empirical Analysis of 8-bit Quantized Small Language Models for On-Device Legal Retrieval
Subrit Dikshit · Feb 18, 2026
The rapid proliferation of Large Language Models (LLMs) has revolutionized Natural Language Processing (NLP) but has simultaneously created a "resource divide." State-of-the-art legal intelligence systems typically rely on massive parameter
- AREG: Adversarial Resource Extraction Game for Evaluating Persuasion and Resistance in Large Language Models
Adib Sakhawat, Fardeen Sadab · Feb 18, 2026
We introduce the Adversarial Resource Extraction Game (AREG), a benchmark that operationalizes persuasion and resistance as a multi-turn, zero-sum negotiation over financial resources.
- Who can we trust? LLM-as-a-jury for Comparative Assessment
Mengjie Qian, Guangzhi Sun, Mark J. F. Gales, Kate M. Knill · Feb 18, 2026
Pairwise Preference
Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwise comparative judgements.
- ColBERT-Zero: To Pre-train Or Not To Pre-train ColBERT models
Antoine Chaffin, Luca Arnaboldi, Amélie Chatelain, Florent Krzakala · Feb 18, 2026
Current state-of-the-art multi-vector models are obtained through a small Knowledge Distillation (KD) training step on top of strong single-vector models, leveraging the large-scale pre-training of these models.
- Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models
Melkamu Abay Mersha, Jugal Kalita · Feb 18, 2026
Transformer models achieve state-of-the-art performance across domains and tasks, yet their deeply layered representations make their predictions difficult to interpret.
- CitiLink-Summ: Summarization of Discussion Subjects in European Portuguese Municipal Meeting Minutes
Miguel Marques, Ana Luísa Fernandes, Ana Filipa Pacheco, Rute Rebouças, Inês Cantante · Feb 18, 2026
A major bottleneck is the scarcity of datasets containing high-quality, manually crafted summaries, which limits the development and evaluation of effective summarization models for this domain.
- Creating a digital poet
Vered Tohar, Tsahi Hayat, Amir Leshem · Feb 18, 2026
Long Horizon
In a blinded authorship test with 50 humanities students and graduates (three AI poems and three poems by well-known poets each), judgments were at chance: human poems were labeled human 54% of the time and AI poems 52%, with 95% confidence
- Utility-Preserving De-Identification for Math Tutoring: Investigating Numeric Ambiguity in the MathEd-PII Benchmark Dataset
Zhuqian Zhou, Kirk Vanacore, Bakhtawar Ahtisham, Jinsook Lee, Doug Pietrzak · Feb 18, 2026
To address this challenge, we investigate the "numeric ambiguity" problem and introduce MathEd-PII, the first benchmark dataset for PII detection in math tutoring dialogues, created through a human-in-the-loop LLM workflow that audits upstr
- Supercharging Agenda Setting Research: The ParlaCAP Dataset of 28 European Parliaments and a Scalable Multilingual LLM-Based Classification
Taja Kuzman Pungeršek, Peter Rupnik, Daniela Širinić, Nikola Ljubešić · Feb 18, 2026
Agreement between the LLM and human annotators is comparable to inter-annotator agreement among humans, and the resulting model outperforms existing CAP classifiers trained on manually-annotated but out-of-domain data.
- Optimizing Soft Prompt Tuning via Structural Evolution
Zhenzhen Huang, Chaoning Zhang, Haoyu Bian, Songbo Zhang, Chi-lok Andy Tai · Feb 18, 2026
Soft prompt tuning leverages continuous embeddings to capture task-specific information in large pre-trained language models (LLMs), achieving competitive performance in few-shot settings.
- From Growing to Looping: A Unified View of Iterative Computation in LLMs
Ferdinand Kapl, Emmanouil Angelis, Kaitlin Maile, Johannes von Oswald, Stefan Bauer · Feb 18, 2026
Looping, reusing a block of layers across depth, and depth growing, training shallow-to-deep models by duplicating middle layers, have both been linked to stronger reasoning, but their relationship remains unclear.
- Learning to Learn from Language Feedback with Social Meta-Learning
Jonathan Cook, Diego Antognini, Martin Klissarov, Claudiu Musat, Edward Grefenstette · Feb 18, 2026
They are rarely proactive in soliciting this feedback, even when faced with ambiguity, which can make their dialogues feel static, one-sided, and lacking the adaptive qualities of human conversation.
- Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling
Jeffrey T. H. Wong, Zixi Zhang, Junyi Liu, Yiren Zhao · Feb 18, 2026
Expert Verification Multi Agent
Existing Multi-Agent Systems (MAS) typically rely on static, homogeneous model configurations, limiting their ability to exploit the distinct strengths of differently post-trained models.
- Training Models on Dialects of Translationese Shows How Lexical Diversity and Source-Target Syntactic Similarity Shape Learning
Jenny Kunz · Feb 18, 2026
Machine-translated data is widely used in multilingual NLP, particularly when native text is scarce.
- IndicEval: A Bilingual Indian Educational Evaluation Framework for Large Language Models
Saurabh Bharti, Gaurav Azad, Abhinaw Jagtap, Nachiket Tapas · Feb 18, 2026
The rapid advancement of large language models (LLMs) necessitates evaluation frameworks that reflect real-world academic rigor and multilingual complexity.
- TabAgent: A Framework for Replacing Agentic Generative Components with Tabular-Textual Classifiers
Ido Levy, Eilam Shapira, Yinon Goldshtein, Avi Yaeli, Nir Mashkif · Feb 18, 2026
Long Horizon
Agentic systems, AI architectures that autonomously execute multi-step workflows to achieve complex goals, are often built using repeated large language model (LLM) calls for closed-set decision tasks such as routing, shortlisting, gating,
- Label-Consistent Data Generation for Aspect-Based Sentiment Analysis Using LLM Agents
Mohammad H. A. Monfared, Lucie Flek, Akbar Karimi · Feb 18, 2026
We propose an agentic data augmentation method for Aspect-Based Sentiment Analysis (ABSA) that uses iterative generation and verification to produce high quality synthetic training examples.
- Variable-Length Semantic IDs for Recommender Systems
Kirill Khrylchenko · Feb 18, 2026
In parallel, the emergent communication literature studies how agents develop discrete communication protocols, often producing variable-length messages in which frequent concepts receive shorter descriptions.
- AI-Driven Structure Refinement of X-ray Diffraction
Bin Cao, Qian Zhang, Zhenjie Feng, Taolue Zhang, Jiaqiang Huang · Feb 18, 2026
We benchmark WPEM on standard reference patterns (PbSO$_4$ and Tb$_2$BaCoO$_5$), where it yields lower $R_p/R_{wp}$ than widely used packages (FullProf and TOPAS) under matched refinement conditions.
- Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents
Nivya Talokar, Ayush K Tarun, Murari Mandal, Maksym Andriushchenko, Antoine Bosselut · Feb 18, 2026
Red Team
LLM-based agents execute real-world workflows via tools and memory.
- MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks
Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen · Feb 18, 2026
Pairwise Preference Web Browsing
Existing evaluations of agents with memory typically assess memorization and action in isolation.
- PREFER: An Ontology for the PREcision FERmentation Community
Txell Amigó, Shawn Zheng Kai Tan, Angel Luu Phanthanourak, Sebastian Schulz, Pasquale D. Colaianni · Feb 18, 2026
Precision fermentation relies on microbial cell factories to produce sustainable food, pharmaceuticals, chemicals, and biofuels.
- MultiCW: A Large-Scale Balanced Benchmark Dataset for Training Robust Check-Worthiness Detection Models
Martin Hyben, Sebastian Kula, Jan Cegin, Jakub Simko, Ivan Srba · Feb 18, 2026
We introduce the Multi-Check-Worthy (MultiCW) dataset, a balanced multilingual benchmark for check-worthy claim detection spanning 16 languages, 7 topical domains, and 2 writing styles.
- Aladdin-FTI @ AMIYA Three Wishes for Arabic NLP: Fidelity, Diglossia, and Multidialectal Generation
Jonathan Mutal, Perla Al Almaoui, Simon Hengchen, Pierrette Bouillon · Feb 18, 2026
Arabic dialects have long been under-represented in Natural Language Processing (NLP) research due to their non-standardization and high variability, which pose challenges for computational modeling.
- Lyapunov Spectral Analysis of Speech Embedding Trajectories in Psychosis
Jelena Vasic, Branislav Andjelic, Ana Mancic, Dusica Filipovic Djurdjevic, Ljiljana Mihic · Feb 18, 2026
We analyze speech embeddings from structured clinical interviews of psychotic patients and healthy controls by treating language production as a high-dimensional dynamical process.
- Are LLMs Ready to Replace Bangla Annotators?
Md. Najib Hasan, Touseef Hasan, Souvika Sarkar · Feb 18, 2026
In this work, we study the behavior of LLMs as zero-shot annotators for Bangla hate speech, a task where even human agreement is challenging, and annotator bias can have serious downstream consequences.
- Long-Tail Knowledge in Large Language Models: Taxonomy, Mechanisms, Interventions and Implications
Sanket Badhe, Deep Shah, Nehal Kathrotia · Feb 18, 2026
We further examine how existing evaluation practices obscure tail behavior and complicate accountability for rare but consequential failures.
- The Validity of Coreference-based Evaluations of Natural Language Understanding
Ian Porada · Feb 18, 2026
In this thesis, I refine our understanding as to what conclusions we can reach from coreference-based evaluations by expanding existing evaluation practices and considering the extent to which evaluation results are either converging or con
- ModalImmune: Immunity Driven Unlearning via Self Destructive Training
Rong Fu, Jia Yee Tan, Wenxin Zhang, Zijian Zhang, Ziming Wang · Feb 18, 2026
Empirical evaluation on standard multimodal benchmarks demonstrates that ModalImmune improves resilience to modality removal and corruption while retaining convergence stability and reconstruction capacity.
- Beyond Learning: A Training-Free Alternative to Model Adaptation
Namkyung Yoon, Kyeonghyun Yoo, Wooyong Jung, Sanghong Kim, Hwangnam Kim · Feb 18, 2026
Despite the continuous research and evolution of language models, they sometimes underperform previous versions.
- Learning Personalized Agents from Human Feedback
Kaiqu Liang, Julia Kruk, Shengyi Qian, Xianjun Yang, Shengjie Bi · Feb 18, 2026
Pairwise Preference
Modern AI agents are powerful but often fail to align with the idiosyncratic, evolving preferences of individual users.
- Discrete Stochastic Localization for Non-autoregressive Generation
Yunshu Wu, Jiayi Cheng, Partha Thakuria, Rob Brekelmans, Evangelos E. Papalexakis · Feb 18, 2026
On OpenWebText, \textsc{DSL} fine-tuning yields large MAUVE gains at low step budgets, surpassing the MDLM+ReMDM baseline with \(\sim\)4$\times$ fewer denoiser evaluations, and matches autoregressive quality at high budgets.
- LLMs Exhibit Significantly Lower Uncertainty in Creative Writing Than Professional Writers
Peiqi Sui · Feb 18, 2026
We formalize this tension by quantifying the "uncertainty gap" between human-authored stories and model-generated continuations.
- Emotion Collider: Dual Hyperbolic Mirror Manifolds for Sentiment Recovery via Anti Emotion Reflection
Rong Fu, Ziming Wang, Shuo Yin, Wenxin Zhang, Haiyun Wei · Feb 18, 2026
Emotional expression underpins natural communication and effective human-computer interaction.
- Balancing Faithfulness and Performance in Reasoning via Multi-Listener Soft Execution
Nithin Sivakumaran, Shoubin Yu, Hyunji Lee, Yue Zhang, Ali Payani · Feb 18, 2026
On multiple reasoning benchmarks (BIG-Bench Extra Hard, MuSR, ZebraLogicBench, and FOLIO), REMUL consistently and substantially improves three measures of faithfulness -- hint attribution, early answering area over the curve (AOC), and mist
- Missing-by-Design: Certifiable Modality Deletion for Revocable Multimodal Sentiment Analysis
Rong Fu, Wenxin Zhang, Ziming Wang, Chunlei Meng, Jiaxuan Lu · Feb 18, 2026
Experiments on benchmark datasets show that MBD achieves strong predictive performance under incomplete inputs and delivers a practical privacy-utility trade-off, positioning surgical unlearning as an efficient alternative to full retrainin