- Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets
Hanna Yukhymenko, Anton Alexandrov, Martin Vechev · Feb 25, 2026 · Citations: 0
Automatic Metrics
The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks.
- SumTablets: A Transliteration Dataset of Sumerian Tablets
Cole Simmons, Richard Diehl Martinez, Dan Jurafsky · Feb 25, 2026 · Citations: 0
Automatic Metrics
Sumerian transliteration is a conventional system for representing a scholar's interpretation of a tablet in the Latin script.
- Improving Parametric Knowledge Access in Reasoning Language Models
Melody Ma, John Hewitt · Feb 25, 2026 · Citations: 0
Automatic Metrics
We study reasoning for accessing world knowledge stored in a language model's parameters.
- GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL
Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang · Feb 25, 2026 · Citations: 0
Automatic Metrics Long Horizon
Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks.
- LiCQA : A Lightweight Complex Question Answering System
Sourav Saha, Dwaipayan Roy, Mandar Mitra · Feb 25, 2026 · Citations: 0
Automatic Metrics
The results of our experiments show that LiCQA significantly outperforms these two state-of-the-art systems on benchmark data with noteworthy reduction in latency.
- DySCO: Dynamic Attention-Scaling Decoding for Long-Context LMs
Xi Ye, Wuwei Zhang, Fangcong Yin, Howard Yen, Danqi Chen · Feb 25, 2026 · Citations: 0
Automatic Metrics
Across multiple instruction-tuned and reasoning models, DySCO consistently improves performance on challenging long-context reasoning benchmarks, yielding relative gains of up to 25% on MRCR and LongBenchV2 at 128K context length with modes
- Dynamic Personality Adaptation in Large Language Models via State Machines
Leon Pielage, Ole Hätscher, Mitja Back, Bernhard Marschall, Benjamin Risse · Feb 25, 2026 · Citations: 0
Simulation Env
This work demonstrates the feasibility of modular, personality-adaptive architectures for education, customer support, and broader human-computer interaction.
- When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models
Satyam Kumar Navneet, Joydeep Chandra, Yong Zhang · Feb 25, 2026 · Citations: 0
Automatic Metrics
Large Language Models (LLMs) are increasingly used to ``professionalize'' workplace communication, often at the cost of linguistic identity.
- NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors
Lingfeng Ren, Weihao Yu, Runpeng Yu, Xinchao Wang · Feb 25, 2026 · Citations: 0
Automatic Metrics
Object hallucination is a critical issue in Large Vision-Language Models (LVLMs), where outputs include objects that do not appear in the input image.
- IndicIFEval: A Benchmark for Verifiable Instruction-Following Evaluation in 14 Indic Languages
Thanmay Jayakumar, Mohammed Safi Ur Rahman Khan, Raj Dabre, Ratish Puduppully, Anoop Kunchukuttan · Feb 25, 2026 · Citations: 0
Automatic Metrics
Instruction-following benchmarks remain predominantly English-centric, leaving a critical evaluation gap for the hundreds of millions of Indic language speakers.
- SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents
Patrick Tser Jern Kon, Archana Pradeep, Ang Chen, Alexander P. Ellis, Warren Hunt · Feb 25, 2026 · Citations: 0
Automatic Metrics Long Horizon
Our approach combines supervised fine-tuning on expert-augmented trajectories with agentic reinforcement learning that explicitly discourages degenerative looping and unproductive expert collaboration.
- Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference
Bo-Wei Chen, Chung-Chi Chen, An-Zi Yen · Feb 25, 2026 · Citations: 0
Automatic Metrics Tool Use
Experiments on the Massive Multitask Language Understanding (MMLU) benchmark show that our approach achieves accuracy comparable to the largest model while reducing computational costs by 20\% to 40\%.
- Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models
Christian Nickel, Laura Schrewe, Florian Mai, Lucie Flek · Feb 25, 2026 · Citations: 0
Automatic Metrics
Theory of Mind (ToM) refers to an agent's ability to model the internal states of others.
- DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain
Walter Hernandez Cruz, Peter Devine, Nikhil Vadgama, Paolo Tasca, Jiahua Xu · Feb 25, 2026 · Citations: 0
Automatic Metrics
We introduce DLT-Corpus, the largest domain-specific text collection for Distributed Ledger Technology (DLT) research to date: 2.98 billion tokens from 22.12 million documents spanning scientific literature (37,440 publications), United Sta
- TG-ASR: Translation-Guided Learning with Parallel Gated Cross Attention for Low-Resource Automatic Speech Recognition
Cheng-Yeh Yang, Chien-Chun Wang, Li-Wei Chen, Hung-Shin Lee, Hsin-Min Wang · Feb 25, 2026 · Citations: 0
Simulation Env
Low-resource automatic speech recognition (ASR) continues to pose significant challenges, primarily due to the limited availability of transcribed data for numerous languages.
- A Diversity Diet for a Healthier Model: A Case Study of French ModernBERT
Louis Estève, Christophe Servan, Thomas Lavergne, Agata Savary · Feb 25, 2026 · Citations: 0
Automatic Metrics
Diversity has been gaining interest in the NLP community in recent years.
- CxMP: A Linguistic Minimal-Pair Benchmark for Evaluating Constructional Understanding in Language Models
Miyu Oba, Saku Sugawara · Feb 25, 2026 · Citations: 0
Automatic Metrics
Most existing benchmarks focus on judging grammatical acceptability, whereas the ability to interpret meanings conveyed by grammatical forms has received much less attention.
- RADAR: Reasoning as Discrimination with Aligned Representations for LLM-based Knowledge Graph Reasoning
Bo Xue, Yuan Jin, Luoyi Fu, Jiaxin Ding, Xinbing Wang · Feb 25, 2026 · Citations: 0
Automatic Metrics
Across four benchmarks, RADAR achieves 5-6% relative gains on link prediction and triple classification over strong LLM baselines, while increasing task-relevant mutual information in intermediate representations by 62.9%, indicating more r
- MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models
Boqi Chen, Xudong Liu, Jiachuan Peng, Marianne Frey-Marti, Bang Zheng · Feb 25, 2026 · Citations: 0
Expert Verification Automatic Metrics
Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity.
- Large Language Models are Algorithmically Blind
Sohan Venkatesh, Ashish Mahendran Kurapath, Tejas Melkote · Feb 25, 2026 · Citations: 0
Automatic Metrics
Models produce ranges far wider than true confidence intervals yet still fail to contain the true algorithmic mean in the majority of instances; most perform worse than random guessing and the marginal above-random performance of the best m
- Small Wins Big: Comparing Large Language Models and Domain Fine-Tuned Models for Sarcasm Detection in Code-Mixed Hinglish Text
Bitan Majumder, Anirban Sen · Feb 25, 2026 · Citations: 0
Automatic MetricsSimulation Env
Sarcasm detection in multilingual and code-mixed environments remains a challenging task for natural language processing models due to structural variations, informal expressions, and low-resource linguistic availability.
- ExpLang: Improved Exploration and Exploitation in LLM Reasoning with On-Policy Thinking Language Selection
Changjiang Gao, Zixian Huang, Kaichen Yang, Jiajun Chen, Jixing Li · Feb 25, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
Analysis shows that, by enabling on-policy thinking language selection as an action during RL, ExpLang effectively extends the RL exploration space with diversified language preference and improves the RL exploitation outcome with leveraged
- DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs
Yanbin Wei, Jiangyue Yan, Chun Kang, Yang Chen, Hua Liu · Feb 25, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
This ``one-size-fits-all'' strategy often neglects model-specific and task-specific preferences, resulting in inaccurate or over-lengthy responses to graph-related queries.
- Personalized Graph-Empowered Large Language Model for Proactive Information Access
Chia Cheng Chang, An-Zi Yen, Hen-Hsen Huang, Hsin-Hsi Chen · Feb 25, 2026 · Citations: 0
Automatic Metrics
Since individuals may struggle to recall all life details and often confuse events, establishing a system to assist users in recalling forgotten experiences is essential.
- Distill and Align Decomposition for Enhanced Claim Verification
Jabez Magomere, Elena Kochkina, Samuel Mensah, Simerjot Kaur, Fernando Acero · Feb 25, 2026 · Citations: 0
Human EvalAutomatic Metrics
Across six evaluation settings, our trained 8B decomposer improves downstream verification performance to (71.75%) macro-F1, outperforming prompt-based approaches ((+1.99), (+6.24)) and existing RL methods ((+5.84)).
- FewMMBench: A Benchmark for Multimodal Few-Shot Learning
Mustafa Dogan, Ilker Kesen, Iacer Calixto, Aykut Erdem, Erkut Erdem · Feb 25, 2026 · Citations: 0
Demonstrations Automatic Metrics
In this paper, we introduce FewMMBench, a comprehensive benchmark designed to evaluate MLLMs under few-shot conditions, with a focus on In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting.
- Scalable Kernel-Based Distances for Statistical Inference and Integration
Masha Naslidnyk · Feb 25, 2026 · Citations: 0
Simulation Env
Representing, comparing, and measuring the distance between probability distributions is a key task in computational statistics and machine learning.
- xai-cola: A Python library for sparsifying counterfactual explanations
Lin Zhu, Lei You · Feb 25, 2026 · Citations: 0
Automatic Metrics
Counterfactual explanation (CE) is an important domain within post-hoc explainability.
- JSAM: Privacy Straggler-Resilient Joint Client Selection and Incentive Mechanism Design in Differentially Private Federated Learning
Ruichen Xu, Ying-Jun Angela Zhang, Jianwei Huang · Feb 25, 2026 · Citations: 0
Automatic Metrics
Extensive evaluations on MNIST and CIFAR-10 demonstrate that JSAM achieves up to 15% improvement in test accuracy compared to existing unbiased selection mechanisms while maintaining cost efficiency across varying data heterogeneity levels.
- DocDjinn: Controllable Synthetic Document Generation with VLMs and Handwriting Diffusion
Marcel Lamott, Saifullah Saifullah, Nauman Riaz, Yves-Noel Weweler, Tobias Alt-Veit · Feb 25, 2026 · Citations: 0
Automatic Metrics
We evaluate across eleven benchmarks spanning key information extraction, question answering, document classification, and document layout analysis.
- Prompt Architecture Determines Reasoning Quality: A Variable Isolation Study on the Car Wash Problem
Heejin Jo · Feb 25, 2026 · Citations: 0
Automatic Metrics
Large language models consistently fail the "car wash problem," a viral reasoning benchmark requiring implicit physical constraint inference.
- D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models
Shunsuke Ubukata · Feb 25, 2026 · Citations: 0
Automatic Metrics Long Horizon
Chain-of-Thought (CoT) distillation from Large Language Models (LLMs) often induces "overthinking" in Small Language Models (SLMs), leading to performance degradation and excessive token consumption.
- fEDM+: A Risk-Based Fuzzy Ethical Decision Making Framework with Principle-Level Explainability and Pluralistic Validation
Abeer Dyoub, Francesca A. Lisi · Feb 25, 2026 · Citations: 0
Automatic Metrics
In a previous work, we introduced the fuzzy Ethical Decision-Making framework (fEDM), a risk-based ethical reasoning architecture grounded in fuzzy logic.
- The ASIR Courage Model: A Phase-Dynamic Framework for Truth Transitions in Human and AI Systems
Hyo Jin Kim · Feb 25, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
Although initially formulated for human truth-telling under asymmetric stakes, the same phase-dynamic architecture extends to AI systems operating under policy constraints and alignment filters.
- Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization
MD. Sagor Chowdhury, Adiba Fairooz Chowdhury · Feb 25, 2026 · Citations: 0
Automatic Metrics
We describe our end-to-end system for Bengali long-form speech recognition (ASR) and speaker diarization submitted to the DL Sprint 4.0 competition on Kaggle.
- Explore-on-Graph: Incentivizing Autonomous Exploration of Large Language Models on Knowledge Graphs with Path-refined Reward Modeling
Shiqi Yan, Yubo Chen, Ruiqi Zhou, Zhengxi Yao, Shuai Chen · Feb 25, 2026 · Citations: 0
Demonstrations Automatic Metrics
Extensive experiments on five KGQA benchmark datasets demonstrate that, to the best of our knowledge, our method achieves state-of-the-art performance, outperforming not only open-source but also even closed-source LLMs.
- Evaluating the relationship between regularity and learnability in recursive numeral systems using Reinforcement Learning
Andrea Silvi, Ponrawee Prasertsom, Jennifer Culbertson, Devdatt Dubhashi, Moa Johansson · Feb 25, 2026 · Citations: 0
Automatic Metrics
Human recursive numeral systems (i.e., counting systems such as English base-10 numerals), like many other grammatical systems, are highly regular.
- Two-Stage Active Distribution Network Voltage Control via LLM-RL Collaboration: A Hybrid Knowledge-Data-Driven Approach
Xu Yang, Chenhui Lin, Xiang Ma, Dong Liu, Ran Zheng · Feb 25, 2026 · Citations: 0
Automatic Metrics
Considering the operational scenarios and requirements in real-world ADNs, in this paper, we propose a hybrid knowledge-data-driven approach that leverages dynamic collaboration between a large language model (LLM) agent and a reinforcement
- SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video
Guanyi Qin, Xiaozhen Wang, Zhu Zhuo, Chang Han Low, Yuancan Xiao · Feb 25, 2026 · Citations: 0
Expert Verification Automatic Metrics
Existing AI systems offer binary safety verification or static detection, ignoring the phase-dependent nature of intraoperative reasoning.
- Dynamic Multimodal Activation Steering for Hallucination Mitigation in Large Vision-Language Models
Jianghao Yin, Qin Chen, Kedi Chen, Jie Zhou, Xingjiao Wu · Feb 25, 2026 · Citations: 0
Automatic Metrics
Large Vision-Language Models (LVLMs) exhibit outstanding performance on vision-language tasks but struggle with hallucination problems.
- Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning
Tomoya Kawabe, Rin Takano · Feb 25, 2026 · Citations: 0
Automatic Metrics Long Horizon
We present a hierarchical multi-agent LLM-based planner with prompt optimization: an upper layer decomposes tasks and assigns them to lower-layer agents, which generate PDDL problems solved by a classical planner.
- DWA-KD: Dual-Space Weighting and Time-Warped Alignment for Cross-Tokenizer Knowledge Distillation
Duc Trung Vu, Pham Khanh Chi, Dat Phi Van, Linh Ngo Van, Sang Dinh · Feb 25, 2026 · Citations: 0
Automatic Metrics
Extensive experiments across diverse NLP benchmarks demonstrate that DWA-KD outperforms state-of-the-art KD baselines, while ablation studies confirm the complementary contributions of entropy-based token weighting and embedding and final h
- Following the Diagnostic Trace: Visual Cognition-guided Cooperative Network for Chest X-Ray Diagnosis
Shaoxuan Wu, Jingkun Chen, Chong Ma, Cong Shen, Xiao Zhang · Feb 25, 2026 · Citations: 0
Automatic Metrics
Human-AI collaboration seeks to enhance the reliability of diagnostic models by integrating the behaviors of controllable radiologists.
- CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning
Zhijiang Tang, Linhua Wang, Jiaxin Qi, Weihao Jiang, Peng Hou · Feb 25, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
Image captioning remains a fundamental task for vision language understanding, yet ground-truth supervision still relies predominantly on human-annotated references.
- Sparsity Induction for Accurate Post-Training Pruning of Large Language Models
Minhao Jiang, Zhikai Li, Xuewen Liu, Jing Zhang, Mengjuan Chen · Feb 25, 2026 · Citations: 0
Automatic Metrics
Large language models have demonstrated capabilities in text generation, while their increasing parameter scales present challenges in computational and memory efficiency.
- PPCR-IM: A System for Multi-layer DAG-based Public Policy Consequence Reasoning and Social Indicator Mapping
Zichen Song, Weijia Li · Feb 25, 2026 · Citations: 0
Automatic Metrics
For each policy episode, the system outputs a structured record containing the DAG, indicator mappings, and three evaluation measures: an expected-indicator coverage score, a discovery rate for overlooked but relevant indicators, and a rela
- Mitigating Structural Noise in Low-Resource S2TT: An Optimized Cascaded Nepali-English Pipeline with Punctuation Restoration
Tangsang Chongbang, Pranesh Pyara Shrestha, Amrit Sarki, Anku Jaiswal · Feb 25, 2026 · Citations: 0
Automatic Metrics
We first establish highly proficient ASR and NMT components: a Wav2Vec2-XLS-R-300m model achieved a state-of-the-art 2.72% CER on OpenSLR-54, and a multi-stage fine-tuned MarianMT model reached a 28.32 BLEU score on the FLORES-200 benchmark
- Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion
Yexing Du, Youcheng Pan, Zekun Wang, Zheng Chu, Yichong Huang · Feb 25, 2026 · Citations: 0
Automatic Metrics
Experimental results demonstrate that our framework surpasses all existing methods on the Multi30K multimodal machine translation benchmark, achieving new state-of-the-art results.
- Multi-dimensional Assessment and Explainable Feedback for Counselor Responses to Client Resistance in Text-based Counseling with LLMs
Anqi Li, Ruihan Wang, Zhaoming Chen, Yuqian Chen, Yu Lu · Feb 25, 2026 · Citations: 0
Automatic Metrics
Although current NLP research has examined overall counseling quality and general therapeutic skills, it fails to provide granular evaluations of high-stakes moments where clients exhibit resistance.
- Self-Correcting VLA: Online Action Refinement via Sparse World Imagination
Chenyv Liu, Wentao Tan, Lei Zhu, Fengling Li, Jingjing Li · Feb 25, 2026 · Citations: 0
Simulation Env Long Horizon
Reinforcement learning enhances physical grounding through exploration yet typically relies on external reward signals that remain isolated from the agent's internal states.
- RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning
Yukun Chen, Jiaming Li, Longze Chen, Ze Gong, Jingpeng Li · Feb 25, 2026 · Citations: 0
Rubric Rating Automatic Metrics
Extensive experiments on various visual reasoning benchmarks show that RuCL yields a remarkable +7.83% average improvement over the Qwen2.5-VL-7B model, achieving a state-of-the-art accuracy of 60.06%.
- When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning
Muku Akasaka, Soyeon Caren Han · Feb 25, 2026 · Citations: 0
Automatic Metrics
In this paper, we conduct a hypothesis-driven analysis of information injection for VSR across three representative VLMs and two public benchmarks.
- Virtual Biopsy for Intracranial Tumors Diagnosis on MRI
Xinzhe Luo, Shuai Shao, Yan Wang, Jiangtao Wang, Yutong Bai · Feb 25, 2026 · Citations: 0
Automatic Metrics
To address these challenges, we construct the ICT-MRI dataset - the first public biopsy-verified benchmark with 249 cases across four categories.
- Structurally Aligned Subtask-Level Memory for Software Engineering Agents
Kangning Shen, Jingyuan Zhang, Chenxi Sun, Wencong Zeng, Yang Yue · Feb 25, 2026 · Citations: 0
Automatic Metrics Long Horizon
Large Language Models (LLMs) have demonstrated significant potential as autonomous software engineering (SWE) agents.
- MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification
Kazi Samin Yasar Alam, Md Tanbir Chowdhury, Tamim Ahmed, Ajwad Abrar, Md Rafid Haque · Feb 25, 2026 · Citations: 0
Human EvalAutomatic Metrics
We benchmark transformer-based models and evaluate zero-shot large language models under structured prompting.
- Retrieval Challenges in Low-Resource Public Service Information: A Case Study on Food Pantry Access
Touseef Hasan, Laila Cure, Souvika Sarkar · Feb 25, 2026 · Citations: 0
Simulation Env
We conduct a pilot evaluation study using community-sourced queries to examine system behavior in realistic scenarios.
- Exploring Human-Machine Coexistence in Symmetrical Reality
Zhenliang Zhang · Feb 25, 2026 · Citations: 0
Automatic Metrics
In the context of the evolution of artificial intelligence (AI), the interaction between humans and AI entities has become increasingly salient, challenging the conventional human-centric paradigms of human-machine interaction.
- Power and Limitations of Aggregation in Compound AI Systems
Nivasini Ananthakrishnan, Meena Jagadeesan · Feb 25, 2026 · Citations: 0
Automatic Metrics
In this work, we investigate the power and limitations of aggregation within a stylized principal-agent framework.
- Revisiting RAG Retrievers: An Information Theoretic Benchmark
Wenqing Zheng, Dmitri Kalaev, Noah Fatsi, Daniel Barcklow, Owen Reinert · Feb 25, 2026 · Citations: 0
Automatic Metrics
Existing benchmarks primarily compare entire RAG pipelines or introduce new datasets, providing little guidance on selecting or combining retrievers themselves.
- From Basis to Basis: Gaussian Particle Representation for Interpretable PDE Operators
Zhihao Li, Yu Feng, Zhilu Lai, Wei Wang · Feb 25, 2026 · Citations: 0
Automatic Metrics
On standard PDE benchmarks and real datasets, our method attains state-of-the-art competitive accuracy while providing intrinsic interpretability.