- Fine-tuning Whisper for Pashto ASR: strategies and scale
Hanif Rahman · Apr 7, 2026 · Citations: 0
Fine-tuned checkpoints and evaluation scripts are released on HuggingFace.
- MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts
Weiyue Li, Ruizhi Qian, Yi Li, Yongce Li, Yunfan Long · Apr 7, 2026 · Citations: 0
As an initial study, we evaluate diverse LLMs under conclusion and summary prompting settings and score outputs with both reference-based metrics and LLM-as-a-judge.
- Transformer See, Transformer Do: Copying as an Intermediate Step in Learning Analogical Reasoning
Philipp Hellwig, Willem Zuidema, Claire E. Stevenson, Martha Lewis · Apr 7, 2026 · Citations: 0
Analogical reasoning is a hallmark of human intelligence, enabling us to solve new problems by transferring knowledge from one situation to another.
- Closing the Speech-Text Gap with Limited Audio for Effective Domain Adaptation in LLM-Based ASR
Thibault Bañeras-Roux, Sergio Burdisso, Esaú Villatoro-Tello, Dairazalia Sánchez-Cortés, Shiran Liu · Apr 7, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- ValueGround: Evaluating Culture-Conditioned Visual Value Grounding in MLLMs
Zhipin Wang, Christoph Leiter, Christian Frey, Mohamed Hesham Ibrahim Abdalla, Josif Grabocka · Apr 7, 2026 · Citations: 0
Yet existing evaluations of cultural values in language models are almost entirely text-only, making it unclear whether models can ground culture-conditioned judgments when response options are visualized.
- DataSTORM: Deep Research on Large-Scale Databases using Exploratory Data Analysis and Data Storytelling
Shicheng Liu, Yucheng Jiang, Sajid Farook, Camila Nicollier Sanchez, David Fernando Castro Pena · Apr 7, 2026 · Citations: 0
Long Horizon
Deep research with Large Language Model (LLM) agents is emerging as a powerful paradigm for multi-step information discovery, synthesis, and analysis.
- Multi-objective Evolutionary Merging Enables Efficient Reasoning Models
Mario Iacobelli, Adrian Robert Minut, Tommaso Mencattini, Donato Crisostomi, Andrea Santilli · Apr 7, 2026 · Citations: 0
Comprehensive experiments across 1.5B, 7B, and 14B parameter scales on six mathematical reasoning benchmarks demonstrate that Evo-L2S can reduce the length of generated reasoning traces by over 50% while preserving, or even improving, the…
- Context-Aware Dialectal Arabic Machine Translation with Interactive Region and Register Selection
Afroza Nowshin, Prithweeraj Acharjee Porag, Haziq Jeelani, Fayeq Jeelani Syed · Apr 7, 2026 · Citations: 0
Through a combination of automatic evaluation and qualitative analysis, we observe an apparent accuracy-fidelity trade-off: high-resource baselines such as NLLB (No Language Left Behind) achieve higher aggregate BLEU scores (13.75) by…
- Learning to Interrupt in Language-based Multi-agent Communication
Danqing Wang, Da Yin, Ruta Desai, Lei Li, Asli Celikyilmaz · Apr 7, 2026 · Citations: 0
Multi Agent
Motivated by this, we propose an interruptible communication framework that allows the agent who is listening to interrupt the current speaker.
- The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning
Yi Xu, Philipp Jettkant, Laura Ruis · Apr 7, 2026 · Citations: 0
Long Horizon
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Team Fusion@ SU@ BC8 SympTEMIST track: transformer-based approach for symptom recognition and linking
Georgi Grazhdanski, Sylvia Vassileva, Ivan Koychev, Svetla Boytcheva · Apr 7, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- When to Call an Apple Red: Humans Follow Introspective Rules, VLMs Don't
Jonathan Nemitz, Carsten Eickhoff, Junyi Jessy Li, Kyle Mahowald, Michal Golovanevsky · Apr 7, 2026 · Citations: 0
To study this, we introduce the Graded Color Attribution (GCA) dataset, a controlled benchmark designed to elicit decision rules and evaluate participant faithfulness to these rules.
- State-of-the-Art Arabic Language Modeling with Sparse MoE Fine-Tuning and Chain-of-Thought Distillation
Navan Preet Singh, Anurag Garikipati, Ahmed Abulkhair, Jyani Akshay Jagdishbhai, Atul Yaduvanshi · Apr 7, 2026 · Citations: 0
Demonstrations
Arabic-DeepSeek-R1 achieves the highest average score across the seven-benchmark OALL suite while establishing SOTA or near-SOTA, including dominant results on grammar-focused MadinahQA (surpassing both GPT-5.1 and the OALL leader by…
- Attention Flows: Tracing LLM Conceptual Engagement via Story Summaries
Rebecca M. M. Hicke, Sil Hamilton, David Mimno, Ross Deans Kristensen-McLachlan · Apr 7, 2026 · Citations: 0
When human authors of summaries compress a story, they reveal what they consider narratively important.
- Say Something Else: Rethinking Contextual Privacy as Information Sufficiency
Yunze Xiao, Wenkai Li, Xiaoyuan Wu, Ningshan Ma, Yueqi Song · Apr 7, 2026 · Citations: 0
LLM agents increasingly draft messages on behalf of users, yet users routinely overshare sensitive information and disagree on what counts as private.
- FMI@SU ToxHabits: Evaluating LLMs Performance on Toxic Habit Extraction in Spanish Clinical Texts
Sylvia Vassileva, Ivan Koychev, Svetla Boytcheva · Apr 7, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- ART: Attention Replacement Technique to Improve Factuality in LLMs
Ziqin Luo, Yihao Quan, Xiaofeng Zhang, Xiaosong Yuan, Chen Shen · Apr 7, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Application-Driven Pedagogical Knowledge Optimization of Open-Source LLMs via Reinforcement Learning and Supervised Fine-Tuning
Navan Preet Singh, Xiaokun Wang, Anurag Garikipati, Madalina Ciobanu, Qingqing Mao · Apr 7, 2026 · Citations: 0
Expert Verification
These models remarkably achieve high enough accuracy on the Cross-Domain Pedagogical Knowledge (CDPK) Benchmark to establish new state-of-the-art (SOTA) results across the interactive Pedagogy Benchmark Leaderboard and surpass significantly…
- The Illusion of Superposition? A Principled Analysis of Latent Thinking in Language Models
Michael Rizvi-Martel, Guillaume Rabusseau, Marius Mosbach · Apr 7, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- A Severity-Based Curriculum Learning Strategy for Arabic Medical Text Generation
Ahmed Alansary, Molham Mohamed, Ali Hamdi · Apr 7, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- In-Context Learning in Speech Language Models: Analyzing the Role of Acoustic Features, Linguistic Structure, and Induction Heads
Charlotte Pouw, Hosein Mohebbi, Afra Alishahi, Willem Zuidema · Apr 7, 2026 · Citations: 0
Demonstrations
In-Context Learning (ICL) has been extensively studied in text-only Language Models, but remains largely unexplored in the speech domain.
- Severity-Aware Weighted Loss for Arabic Medical Text Generation
Ahmed Alansary, Molham Mohamed, Ali Hamdi · Apr 7, 2026 · Citations: 0
Experiments are conducted using the MAQA dataset, which provides Arabic medical complaints and trusted human responses.
- STDec: Spatio-Temporal Stability Guided Decoding for dLLMs
Yuzhe Chen, Jiale Cao, Xuyang Liu, Jin Xie, Aiping Yang · Apr 7, 2026 · Citations: 0
Across textual reasoning and multimodal understanding benchmarks, STDec substantially improves throughput while maintaining comparable task performance score.
- Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework
Komal Kumar, Aman Chadha, Salman Khan, Fahad Shahbaz Khan, Hisham Cholakkal · Apr 7, 2026 · Citations: 0
Multi Agent
Recent advances in multi-agent large language models (LLMs) have demonstrated strong potential for understanding user intent and are being trained to utilize various tools.
- In-Place Test-Time Training
Guhao Feng, Shengjie Luo, Kai Hua, Ge Zhang, Di He · Apr 7, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control
Yuchi Wang, Haiyang Yu, Weikang Bian, Jiefeng Long, Xiao Liang · Apr 7, 2026 · Citations: 0
Pairwise Preference
Experiments on the MMEB-V2 benchmark demonstrate that our model achieves a score of 71.2 with only 4B parameters, establishing a new state-of-the-art while significantly reducing reasoning overhead and inference latency.
- Toward Consistent World Models with Multi-Token Prediction and Latent Semantic Enhancement
Qimin Zhong, Hao Liao, Haiming Qin, Mingyang Zhou, Rui Mao · Apr 7, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Exclusive Unlearning
Mutsumi Sasaki, Kouta Nakayama, Yusuke Miyao, Yohei Oseki, Masaru Isonuma · Apr 7, 2026 · Citations: 0
Red Team
We demonstrate that through Exclusive Unlearning, it is possible to obtain a model that ensures safety against a wide range of inputs, including jailbreaks, while maintaining the ability to respond to diverse instructions related to…
- ACE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments
Wang Yang, Chaoda Song, Xinpeng Li, Debargha Ganguly, Chuang Ma · Apr 7, 2026 · Citations: 0
Existing Agent benchmarks suffer from two critical limitations: high environment interaction overhead (up to 41\% of total evaluation time) and imbalanced task horizon and difficulty distributions that make aggregate scores unreliable.
- JUÁ -- A Benchmark for Information Retrieval in Brazilian Legal Text Collections
Jayr Pereira, Leandro Fernandes, Erick de Brito, Roberto Lotufo, Luiz Bonifacio · Apr 7, 2026 · Citations: 0
We present JUÁ, a public benchmark for Brazilian legal retrieval designed to support more reproducible and comparable evaluation across heterogeneous legal collections.
- Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives
Changgeon Ko, Jisu Shin, Hoyun Song, Huije Lee, Eui Jun Hwang · Apr 7, 2026 · Citations: 0
Multi Agent
Large language model (LLM) agents are increasingly acting as human delegates in multi-agent environments, where a representative agent integrates diverse peer perspectives to make a final decision.
- LAG-XAI: A Lie-Inspired Affine Geometric Framework for Interpretable Paraphrasing in Transformer Latent Spaces
Olexander Mazurets, Olexander Barmak, Leonid Bedratyuk, Iurii Krak · Apr 7, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Stories of Your Life as Others: A Round-Trip Evaluation of LLM-Generated Life Stories Conditioned on Rich Psychometric Profiles
Ben Wigler, Maria Tsfasman, Tiffany Matej Hrkalovic · Apr 7, 2026 · Citations: 0
Personality traits are richly encoded in natural language, and large language models (LLMs) trained on human text can simulate personality when conditioned on persona descriptions.
- Short Data, Long Context: Distilling Positional Knowledge in Transformers
Patrick Huber, Ernie Chang, Chinnadhurai Sankar, Rylan Conway, Igor Fedorov · Apr 7, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- From Hallucination to Structure Snowballing: The Alignment Tax of Constrained Decoding in LLM Reflection
Hongxu Zhou · Apr 7, 2026 · Citations: 0
Critique Edit
While structured feedback can mitigate this issue, existing approaches often rely on externally trained critics or symbolic tools, reducing agent autonomy.
- A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models
Maria Mahbub, Gregory M. Dams, Josh Arnold, Caitlin Rizy, Sudarshan Srinivasan · Apr 7, 2026 · Citations: 0
Expert Verification
Conventional evaluation methods rely heavily on annotation-intensive reference standards or incomplete structured data, limiting feasibility at population scale.
- BiMind: A Dual-Head Reasoning Model with Attention-Geometry Adapter for Incorrect Information Detection
Zhongxing Zhang, Emily K. Vraga, Jisu Huh, Jaideep Srivastava · Apr 7, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Epistemic Blinding: An Inference-Time Protocol for Auditing Prior Contamination in LLM-Assisted Analysis
Michael Cuccarese · Apr 7, 2026 · Citations: 0
Demonstrations
This paper presents epistemic blinding in the context of an agentic system that uses large language models to reason across multiple biological datasets for drug target prioritization.
- Disentangling MLP Neuron Weights in Vocabulary Space
Asaf Avrahamy, Yoav Gur-Arieh, Mor Geva · Apr 7, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- The Model Agreed, But Didn't Learn: Diagnosing Surface Compliance in Large Language Models
Xiaojie Gu, Ziying Huang, Weicong Hong, Jian Xie, Renze Lou · Apr 7, 2026 · Citations: 0
However, while recent editors demonstrate high success rates on standard benchmarks, it remains questionable whether current evaluation frameworks that rely on assessing output under specific prompting conditions can reliably authenticate…
- Arch: An AI-Native Hardware Description Language for Register-Transfer Clocked Hardware Design
Shuqing Zhao · Apr 7, 2026 · Citations: 0
We present case studies of an 8-way set-associative L1 data cache and a synthesizable PG021-compatible AXI DMA controller (with Yosys and OpenSTA results on Sky130), and compare Arch to SystemVerilog, VHDL, Chisel, Bluespec, and other…
- Is CLIP Cross-Eyed? Revealing and Mitigating Center Bias in the CLIP Family
Oscar Chew, Hsiao-Ying Huang, Kunal Jain, Tai-I Chen, Khoa D Doan · Apr 7, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- FinReporting: An Agentic Workflow for Localized Reporting of Cross-Jurisdiction Financial Disclosures
Fan Zhang, Mingzi Song, Rania Elbadry, Yankai Chen, Shaobo Wang · Apr 7, 2026 · Citations: 0
We present FinReporting, an agentic workflow for localized cross-jurisdiction financial reporting.
- Towards Trustworthy Report Generation: A Deep Research Agent with Progressive Confidence Estimation and Calibration
Yi Yuan, Xuhong Wang, Shanzhe Lei · Apr 7, 2026 · Citations: 0
As agent-based systems continue to evolve, deep research agents are capable of automatically generating research-style reports across diverse domains.
- BOSCH: Black-Box Binary Optimization for Short-Context Attention-Head Selection in LLMs
Abbas Ghaddar, Ivan Kobyzev, Boxing Chen, Yufei Cui · Apr 7, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- "I See What You Did There": Can Large Vision-Language Models Understand Multimodal Puns?
Naen Xu, Jiayi Sheng, Changjiang Li, Chunyi Zhou, Yuyuan Li · Apr 7, 2026 · Citations: 0
Although Vision-Language Models (VLMs) are widely used in multimodal understanding and generation, their ability to understand puns has not been systematically studied due to a scarcity of rigorous benchmarks.
- The UNDO Flip-Flop: A Controlled Probe for Reversible Semantic State Management in State Space Model
Hongxu Zhou · Apr 7, 2026 · Citations: 0
Existing benchmarks probe either monotonic state tracking, as in the standard Flip-Flop task, or structural nesting, as in the Dyck languages, but neither isolates reversible semantic state retrieval.
- FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks
Michael Krumdick, Varshini Reddy, Shivani Chaudhary, William Day, Maarij Ahmed · Apr 7, 2026 · Citations: 0
Rubric Rating Long Horizon
To address this, we introduce FrontierFinance, a long-horizon benchmark of 25 complex financial modeling tasks across five core finance models, requiring an average of over 18 hours of skilled human labor per task to complete.
- FRENCH-YMCA: A FRENCH Corpus meeting the language needs of Youth, froM Children to Adolescents
Cherifa Ben Khelil, Jean-Yves Antoine, Anaïs Halftermeyer, Frédéric Rayar, Mathieu Thebaud · Apr 7, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Mechanistic Circuit-Based Knowledge Editing in Large Language Models
Tianyi Zhao, Yinhan He, Wendy Zheng, Chen Chen · Apr 7, 2026 · Citations: 0
Long Horizon
Extensive experiments on the MQuAKE-3K benchmark demonstrate the effectiveness of the proposed method for multi-hop reasoning in knowledge editing.
- Swiss-Bench 003: Evaluating LLM Reliability and Adversarial Security for Swiss Regulatory Contexts
Fatih Uenal · Apr 7, 2026 · Citations: 0
Self-graded D7 scores (73-94%) exceed externally judged D8 security scores (20-61%) by a wide margin, though these dimensions use non-comparable scoring regimes.
- Understanding Performance Gap Between Parallel and Sequential Sampling in Large Reasoning Models
Xiangming Gu, Soham De, Larisa Markeeva, Petar Veličković, Razvan Pascanu · Apr 7, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Beyond Paper-to-Paper: Structured Profiling and Rubric Scoring for Paper-Reviewer Matching
Yicheng Pan, Zhiyuan Ning, Ludi Wang, Yi Du · Apr 7, 2026 · Citations: 0
Rubric Rating
To address this gap, we propose P2R, a training-free framework that shifts from implicit paper-to-paper matching to explicit profile-based matching.
- LoRM: Learning the Language of Rotating Machinery for Self-Supervised Condition Monitoring
Xiao Qin, Xingyi Song, Tong Liu, Hatim Laalej, Zepeng Liu · Apr 7, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Evaluating Learner Representations for Differentiation Prior to Instructional Outcomes
Junsoo Park, Youssef Medhat, Htet Phyo Wai, Ploy Thajchayapong, Ashok K. Goel · Apr 7, 2026 · Citations: 0
Pairwise Preference
We introduce distinctiveness, a representation-level measure that evaluates how each learner differs from others in the cohort using pairwise distances, without requiring clustering, labels, or task-specific evaluation.
- AgentGL: Towards Agentic Graph Learning with LLMs via Reinforcement Learning
Yuanfu Sun, Kang Li, Dongzhe Fan, Jiajin Liu, Qiaoyu Tan · Apr 7, 2026 · Citations: 0
Tool Use
To bridge this gap, we introduce Agentic Graph Learning (AGL), a paradigm that reframes graph learning as an interleaved process of topology-aware navigation and LLM-based inference.
- "OK Aura, Be Fair With Me": Demographics-Agnostic Training for Bias Mitigation in Wake-up Word Detection
Fernando López, Paula Delgado-Santos, Pablo Gómez, David Solans, Jordi Luque · Apr 7, 2026 · Citations: 0
We utilize the OK Aura database for our experiments, employing a training methodology that excludes demographic labels, which are reserved for evaluation purposes.
- CLEAR: Cross-Lingual Enhancement in Alignment via Reverse-training
Seungyoon Lee, Minhyuk Kim, Seongtae Hong, Youngjoon Jang, Dongsuk Oh · Apr 7, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering
Yingjian Zhu, Xinming Wang, Kun Ding, Ying Wang, Bin Fan · Apr 7, 2026 · Citations: 0
Rather than serving merely as answer generators, we assign VLMs two specialized agents: a Refiner and an Inspector.
- Measuring What Matters!! Assessing Therapeutic Principles in Mental-Health Conversation
Abdullah Mazhar, Het Riteshkumar Shah, Aseem Srivastava, Smriti Joshi, Md Shad Akhtar · Apr 7, 2026 · Citations: 0
The increasing use of large language models in mental health applications calls for principled evaluation frameworks that assess alignment with psychotherapeutic best practices beyond surface-level fluency.