- PolyFrame at MWE-2026 AdMIRe 2: When Words Are Not Enough: Multimodal Idiom Disambiguation
Nina Hosseini-Kivanani · Feb 20, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- DP-RFT: Learning to Generate Synthetic Text via Differentially Private Reinforcement Fine-Tuning
Fangyuan Xu, Sihao Chen, Zinan Lin, Taiwei Shi, Sydney Graham · Feb 20, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Diagnosing LLM Reranker Behavior Under Fixed Evidence Pools
Baris Arat, Emre Sefer · Feb 20, 2026 · Citations: 0
Standard reranking evaluations study how a reranker orders candidates returned by an upstream retriever.
- Luna-2: Scalable Single-Token Evaluation with Small Language Models
Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel · Feb 20, 2026 · Citations: 0
We present Luna-2, a novel architecture that leverages decoder-only small language models (SLMs) into a deterministic evaluation model to reliably compute complex task-specific LLMAJ metrics (e.g.
- Hierarchical Reward Design from Language: Enhancing Alignment of Agent Behavior with Human Specifications
Zhiqin Qian, Ryan Diaz, Sangwon Seo, Vaibhav Unhelkar · Feb 20, 2026 · Citations: 0
Pairwise Preference Long Horizon
When training artificial intelligence (AI) to perform tasks, humans often care not only about whether a task is completed but also how it is performed.
- VIRAASAT: Traversing Novel Paths for Indian Cultural Reasoning
Harshul Raj Surana, Arijit Maji, Aryan Vats, Akash Ghosh, Sriparna Saha · Feb 20, 2026 · Citations: 0
Existing Cultural benchmarks are (i) Manually crafted, (ii) contain single-hop questions testing factual recall, and (iii) prohibitively costly to scale, leaving this deficiency largely unmeasured.
- RVR: Retrieve-Verify-Retrieve for Comprehensive Question Answering
Deniz Qian, Hung-Ting Chen, Eunsol Choi · Feb 20, 2026 · Citations: 0
Our method outperforms baselines, including agentic search approaches, achieving at least 10% relative and 3% absolute gain in complete recall percentage on a multi-answer retrieval dataset (QAMPARI).
- SPQ: An Ensemble Technique for Large Language Model Compression
Jiamin Yao, Eren Gultepe · Feb 20, 2026 · Citations: 0
Applied to LLaMA-2-7B, SPQ achieves up to 75% memory reduction while maintaining or improving perplexity (e.g., WikiText-2 5.47 to 4.91) and preserving accuracy on downstream benchmarks such as C4, TruthfulQA, and GSM8K.
- Subgroups of $U(d)$ Induce Natural RNN and Transformer Architectures
Joshua Nunley · Feb 20, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Validating Political Position Predictions of Arguments
Jordan Robinson, Angus R. Williams, Katie Atkinson, Anthony G. Cohn · Feb 20, 2026 · Citations: 0
Pairwise Preference
Real-world knowledge representation often requires capturing subjective, continuous attributes -- such as political positions -- that conflict with pairwise validation, the widely accepted gold standard for human evaluation.
- Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System
Pavithra PM Nair, Preethu Rose Anish · Feb 20, 2026 · Citations: 0
Vichara surpasses existing judgment prediction benchmarks on both datasets, with GPT-4o mini achieving the highest performance (F1: 81.5 on PredEx, 80.3 on ILDC_expert), followed by Llama-3.1-8B.
- On the "Induction Bias" in Sequence Models
M. Reza Ebrahimi, Michaël Defferrard, Sunny Panchal, Roland Memisevic · Feb 20, 2026 · Citations: 0
Together, these results demonstrate that state tracking remains a fundamental challenge for transformers, even when training and evaluation distributions match.
- Predicting Contextual Informativeness for Vocabulary Learning using Deep Learning
Tao Wu, Adam Kapelner · Feb 20, 2026 · Citations: 0
In summary, we demonstrate that a modern embedding model on neural network architecture, when guided by human supervision, results in a low-cost large supply of near-perfect contexts for teaching vocabulary for a variety of target words.
- PsihoRo: Depression and Anxiety Romanian Text Corpus
Alexandra Ciobotaru, Ana-Maria Bucur, Liviu P. Dinu · Feb 20, 2026 · Citations: 0
Psychological corpora in NLP are collections of texts used to analyze human psychology, emotions, and mental health.
- VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean
Yutong Xin, Qiaochu Chen, Greg Durrett, Işil Dillig · Feb 20, 2026 · Citations: 0
However, most benchmarks for LLM-based proof automation are drawn from mathematics in the Mathlib ecosystem, whereas proofs in software verification are developed inside definition-rich codebases with substantial project-specific libraries.
- On the Semantic and Syntactic Information Encoded in Proto-Tokens for One-Step Text Reconstruction
Ivan Bondarenko, Egor Palkin, Fedor Tikunov · Feb 20, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory
Usman Anwar, Tim Bakker, Dana Kianfar, Cristina Pinneri, Christos Louizos · Feb 20, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Context-Aware Mapping of 2D Drawing Annotations to 3D CAD Features Using LLM-Assisted Reasoning for Manufacturing Automation
Muhammad Tayyab Khan, Lequn Chen, Wenhe Feng, Seung Ki Moon · Feb 20, 2026 · Citations: 0
When deterministic scoring cannot resolve an ambiguity, the system escalates to multimodal and constrained large-language-model reasoning, followed by a single human-in-the-loop (HITL) review step.
- Simplifying Outcomes of Language Model Component Analyses with ELIA
Aaron Louis Eidt, Nils Feldhus · Feb 20, 2026 · Citations: 0
Pairwise Preference
The effectiveness of this approach was empirically validated through a mixed-methods user study, which revealed a clear preference for interactive, explorable interfaces over simpler, static visualizations.
- Thinking by Subtraction: Confidence-Driven Contrastive Decoding for LLM Reasoning
Lexiang Tang, Weihao Gao, Bingchen Zhao, Lu Ma, Qiao jin · Feb 20, 2026 · Citations: 0
Experiments show that CCD significantly improves accuracy across mathematical reasoning benchmarks while substantially reducing output length, with minimal KV-cache overhead.
- Information-Theoretic Storage Cost in Sentence Comprehension
Kohei Kajikawa, Shinnosuke Isono, Ethan Gotlieb Wilcox · Feb 20, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Improving Sampling for Masked Diffusion Models via Information Gain
Kaisen Yang, Jayden Teoh, Kaicheng Yang, Yitong Zhang, Alex Lamb · Feb 20, 2026 · Citations: 0
Extensive evaluations across diverse architectures and tasks (reasoning, coding, creative writing, and image generation) demonstrate that Info-Gain Sampler consistently outperforms existing samplers for MDMs.
- Click it or Leave it: Detecting and Spoiling Clickbait with Informativeness Measures and Large Language Models
Wojciech Michaluk, Tymoteusz Urban, Mateusz Kubita, Soveatin Kuntur, Anna Wroblewska · Feb 20, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- FENCE: A Financial and Multimodal Jailbreak Detection Dataset
Mirae Kim, Seonghun Jeong, Youngjun Kwak · Feb 20, 2026 · Citations: 0
Red Team
A baseline detector trained on FENCE achieves 99 percent in-distribution accuracy and maintains strong performance on external benchmarks, underscoring the dataset's robustness for training reliable detection models.
- The Statistical Signature of LLMs
Ortal Hadad, Edoardo Loru, Jacopo Nudo, Niccolò Di Marco, Matteo Cinelli · Feb 20, 2026 · Citations: 0
We analyze compression behavior across three progressively more complex information ecosystems: controlled human-LLM continuations, generative mediation of a knowledge infrastructure (Wikipedia vs.
- Detecting Contextual Hallucinations in LLMs with Frequency-Aware Attention
Siya Qi, Yudong Chen, Runcong Zhao, Qinglin Zhu, Zhanghao Hu · Feb 20, 2026 · Citations: 0
Experiments on the RAGTruth and HalluRAG benchmarks show that our approach achieves performance gains over verification-based, internal-representation-based, and attention-based methods across models and tasks.
- Agentic Adversarial QA for Improving Domain-Specific LLMs
Vincent Grari, Ciprian Tomoiaga, Sylvain Lamprier, Tatsunori Hashimoto, Marcin Detyniecki · Feb 20, 2026 · Citations: 0
Evaluation on specialized subsets of the LegalBench corpus demonstrates that our method achieves greater accuracy with substantially fewer synthetic samples.
- Perceived Political Bias in LLMs Reduces Persuasive Abilities
Matthew DiGiuseppe, Joshua Robison · Feb 20, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards
Johannes Ackermann, Michael Noukhovitch, Takashi Ishida, Masashi Sugiyama · Feb 20, 2026 · Citations: 0
Reinforcement Learning from Human Feedback (RLHF) or Verifiable Rewards (RLVR) are two key steps in the post-training of modern Language Models (LMs).
- CogGen: Cognitive-Load-Informed Fully Unsupervised Deep Generative Modeling for Compressively Sampled MRI Reconstruction
Qingyong Zhu, Yumin Tan, Xiang Gu, Dong Liang · Feb 20, 2026 · Citations: 0
- Towards More Standardized AI Evaluation: From Models to Agents
Ali El Filali, Inès Bedar · Feb 20, 2026 · Citations: 0
Evaluation is no longer a final checkpoint in the machine learning lifecycle.
- NIMMGen: Learning Neural-Integrated Mechanistic Digital Twins with LLMs
Zihan Guan, Rituparna Datta, Mengxuan Hu, Shunshun Liu, Aiying Zhang · Feb 20, 2026 · Citations: 0
Recent work has explored LLM-based agentic frameworks to automatically construct mechanistic models from data; however, existing problem settings substantially oversimplify real-world conditions, leaving it unclear whether LLM-generated…
- Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering
Amine Kobeissi, Philippe Langlais · Feb 20, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications
Victoria Blake, Mathew Miller, Jamie Novak, Sze-yuan Ooi, Blanca Gallego · Feb 20, 2026 · Citations: 0
Expert Verification
The framework was evaluated on five lexically heterogeneous clinical concepts against a manually curated benchmark and gold-standard concept sets.
- Analyzing LLM Instruction Optimization for Tabular Fact Verification
Xiaotang Du, Giwon Hong, Wai-Chung Kwan, Rohit Saxena, Ivan Titov · Feb 20, 2026 · Citations: 0
We study three optimizers from the DSPy framework -- COPRO, MiPROv2, and SIMBA -- across four benchmarks and three model families.
- Condition-Gated Reasoning for Context-Dependent Biomedical Question Answering
Jash Rajesh Parekh, Wonbin Kweon, Joey Chan, Rezarta Islamaj, Robert Leaman · Feb 20, 2026 · Citations: 0
Existing benchmarks do not evaluate such conditional reasoning, and retrieval-augmented or graph-based methods lack explicit mechanisms to ensure that retrieved knowledge is applicable to given context.
- Improving Neural Topic Modeling with Semantically-Grounded Soft Label Distributions
Raymond Li, Amirhossein Abaskohi, Chuyuan Li, Gabriel Murray, Giuseppe Carenini · Feb 20, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Games That Teach, Chats That Convince: Comparing Interactive and Static Formats for Persuasive Learning
Seyed Hossein Alavi, Zining Wang, Shruthi Chockkalingam, Raymond T. Ng, Vered Shwartz · Feb 20, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.