- ATLAS-RTC: Closing the Loop on LLM Agent Output with Token-Level Runtime Control
Christopher Cruz · Mar 29, 2026 · Citations: 0
- Article and Comment Frames Shape the Quality of Online Comments
Matteo Guida, Yulia Otmakhova, Eduard Hovy, Lea Frermann · Mar 29, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- HumMusQA: A Human-written Music Understanding QA Benchmark Dataset
Benno Weck, Pablo Puentes, Andrea Poltronieri, Satyajeet Prabhu, Dmitry Bogdanov · Mar 29, 2026 · Citations: 0
The evaluation of music understanding in Large Audio-Language Models (LALMs) requires a rigorously defined benchmark that truly tests whether models can perceive and interpret music, a standard that current data methodologies frequently…
- KazByte: Adapting Qwen models to Kazakh via Byte-level Adapter
Rauan Akylzhanov · Mar 29, 2026 · Citations: 0
Our central hypothesis is that this two-stage process -- first teach the interface, then adapt the model -- should match or exceed the accuracy of the original Qwen2.5-7B on standard Kazakh benchmarks.
- What can LLMs tell us about the mechanisms behind polarity illusions in humans? Experiments across model scales and training steps
Dario Paape · Mar 29, 2026 · Citations: 0
The results have implications for human sentence processing: it may not be necessary to assume "rational inference" mechanisms that convert ill-formed sentences into well-formed ones to explain polarity illusions, given that LLMs cannot…
- EffiSkill: Agent Skill Based Automated Code Efficiency Optimization
Zimu Wang, Yuling Shi, Mengfan Li, Zijun Liu, Jie M. Zhang · Mar 29, 2026 · Citations: 0
In this paper, we present EffiSkill, a framework for code-efficiency optimization that builds a portable optimization toolbox for LLM-based agents.
- Model Capability Dominates: Inference-Time Optimization Lessons from AIMO 3
Natapong Nitarach · Mar 29, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- ProText: A benchmark dataset for measuring (mis)gendering in long-form texts
Hadas Kotek, Margit Bowler, Patrick Sonnenberg, Yu'an Yang · Mar 29, 2026 · Citations: 0
The dataset is designed to probe (mis)gendering in text transformations such as summarization and rewrites using state-of-the-art Large Language Models, extending beyond traditional pronoun resolution benchmarks and beyond the gender…
- Q-Bridge: Code Translation for Quantum Machine Learning via LLMs
Runjia Zeng, Priyabrata Senapati, Ruixiang Tang, Dongfang Liu, Qiang Guan · Mar 29, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning
Zhiwen You, Xi Chen, Aniket Vashishtha, Simo Du, Gabriel Erion-Barner · Mar 29, 2026 · Citations: 0
Expert Verification Multi Agent
In this work, we propose a counterfactual multi-agent diagnostic framework inspired by clinician training that makes hypothesis testing explicit and evidence-grounded.
- KVSculpt: KV Cache Compression as Distillation
Bo Jiang, Sian Jin · Mar 29, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Conversational Agents and the Understanding of Human Language: Reflections on AI, LLMs, and Cognitive Science
Andrei Popescu-Belis · Mar 29, 2026 · Citations: 0
In this paper, we discuss the relationship between natural language processing by computers (NLP) and the understanding of the human language capacity, as studied by linguistics and cognitive science.
- Understanding Teacher Revisions of Large Language Model-Generated Feedback
Conrad Borchers, Luiz Rodrigues, Newarney Torrezão da Costa, Cleon Xavier, Rafael Ferreira Mello · Mar 29, 2026 · Citations: 0
Critique EditRlaif Or Synthetic Feedback
First, we find that teachers accept AI feedback without modification in about 80% of cases, while edited feedback tends to be significantly longer and subsequently shortened by teachers.
- Emergent Social Intelligence Risks in Generative Multi-Agent Systems
Yue Huang, Yu Jiang, Wenjie Wang, Haomin Zhuang, Xiaonan Luo · Mar 29, 2026 · Citations: 0
Multi Agent
Multi-agent systems composed of large generative models are rapidly moving from laboratory prototypes to real-world deployments, where they jointly plan, negotiate, and allocate shared resources to solve complex tasks.
- TailNLG: A Multilingual Benchmark Addressing Verbalization of Long-Tail Entities
Lia Draetta, Michael Oliverio, Virginia Ramón-Ferrer, Pier Felice Balestrucci, Flaviana Corallo · Mar 29, 2026 · Citations: 0
We introduce TailNLG, a new multilingual benchmark in English, Italian, and Spanish, built from Wikidata and covering entities with varying levels of popularity.
- Let the Agent Steer: Closed-Loop Ranking Optimization via Influence Exchange
Yin Cheng, Liao Zhou, Xiyu Liang, Dihao Luo, Tewei Lee · Mar 29, 2026 · Citations: 0
- Retromorphic Testing with Hierarchical Verification for Hallucination Detection in RAG
Boxi Yu, Yuzhong Zhang, Liting Lin, Lionel Briand, Emir Muñoz · Mar 29, 2026 · Citations: 0
We evaluate RT4CHART on RAGTruth++ (408 samples) and RAGTruth-Enhance (2,675 samples), a newly re-annotated benchmark.
- KAT-Coder-V2 Technical Report
Fengxiang Li, Han Zhang, Haoyang Huang, Jinghui Wang, Jinhua Hao · Mar 29, 2026 · Citations: 0
We present KAT-Coder-V2, an agentic coding model developed by the KwaiKAT team at Kuaishou.
- Can Large Language Models Simulate Human Cognition Beyond Behavioral Imitation?
Yuxuan Gu, Lunjun Liu, Xiaocheng Feng, Kun Zhu, Weihong Zhong · Mar 29, 2026 · Citations: 0
An essential problem in artificial intelligence is whether LLMs can simulate human cognition or merely imitate surface-level behaviors, while existing datasets suffer from either synthetic reasoning traces or population-level aggregation,…
- Investigating the Influence of Language on Sycophantic Behavior of Multilingual LLMs
Bayan Abdullah Aldahlawi, A. B. M. Ashikur Rahman, Irfan Ahmad · Mar 29, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- The Degree of Language Diacriticity and Its Effect on Tasks
Adi Cohen, Yuval Pinter · Mar 29, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Budget-Xfer: Budget-Constrained Source Language Selection for Cross-Lingual Transfer to African Languages
Tewodros Kederalah Idris, Roald Eiselen, Prasenjit Mitra · Mar 29, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- PRBench: End-to-end Paper Reproduction in Physics Research
Shi Qiu, Junyi Deng, Yiwei Deng, Haoran Dong, Jieyu Fu · Mar 29, 2026 · Citations: 0
Rubric RatingExpert Verification
We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 11 subfields of physics.
- Umwelt Engineering: Designing the Cognitive Worlds of Linguistic Agents
Rodney Jehu-Appiah · Mar 29, 2026 · Citations: 0
No constrained agent outperforms the control individually, yet a 3-agent ensemble achieves 100% ground-truth coverage versus 88.2% for the control.
- LongCat-Next: Lexicalizing Modalities as Discrete Tokens
Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang · Mar 29, 2026 · Citations: 0
As an industrial-strength foundation model, it excels at seeing, painting, and talking within a single framework, achieving strong performance across a wide range of multimodal benchmarks.
- A gentle tutorial and a structured reformulation of Bock's algorithm for minimum directed spanning trees
Yuxi Wang, Jungyeul Park · Mar 29, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Cross-attentive Cohesive Subgraph Embedding to Mitigate Oversquashing in GNNs
Tanvir Hossain, Muhammad Ifte Khairul Islam, Lilia Chebbah, Charles Fanning, Esra Akbas · Mar 29, 2026 · Citations: 0
- Hidden Ads: Behavior Triggered Semantic Backdoors for Advertisement Injection in Vision Language Models
Duanyi Yao, Changyue Li, Zhicong Huang, Cheng Hong, Songze Li · Mar 29, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Over-Refusal and Representation Subspaces: A Mechanistic Analysis of Task-Conditioned Refusal in Aligned LLMs
Utsav Maskey, Mark Dras, Usman Naseem · Mar 29, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- AgentSwing: Adaptive Parallel Context Management Routing for Long-Horizon Web Agents
Zhaopeng Feng, Liangcai Su, Zhen Zhang, Xinyu Wang, Xiaotian Zhang · Mar 29, 2026 · Citations: 0
Long Horizon
As large language models (LLMs) evolve into autonomous agents for long-horizon information-seeking, managing finite context capacity has become a critical bottleneck.
- A tree interpretation of arc standard dependency derivation
Zihao Huang, Ai Ka Lee, Jungyeul Park · Mar 29, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Multi-Agent Dialectical Refinement for Enhanced Argument Classification
Jakub Bąba, Jarosław A. Chudziak · Mar 29, 2026 · Citations: 0
Multi Agent
We introduce MAD-ACC (Multi-Agent Debate for Argument Component Classification), a framework that leverages dialectical refinement to resolve classification uncertainty.
- Improving Attributed Long-form Question Answering with Intent Awareness
Xinran Zhao, Aakanksha Naik, Jay DeYoung, Joseph Chee Chang, Jena D. Hwang · Mar 28, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- The Geometry of Harmful Intent: Training-Free Anomaly Detection via Angular Deviation in LLM Residual Streams
Isaac Llorente-Saguer · Mar 28, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Heterogeneous Debate Engine: Identity-Grounded Cognitive Architecture for Resilient LLM-Based Ethical Tutoring
Jakub Masłowski, Jarosław A. Chudziak · Mar 28, 2026 · Citations: 0
Multi Agent
Large Language Models (LLMs) are being increasingly used as autonomous agents in complex reasoning tasks, opening the niche for dialectical interactions.
- Not Worth Mentioning? A Pilot Study on Salient Proposition Annotation
Amir Zeldes, Katherine Conhaim, Lauren Levine · Mar 28, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Culturally Adaptive Explainable LLM Assessment for Multilingual Information Disorder: A Human-in-the-Loop Approach
Maziar Kianimoghadam Jouneghani · Mar 28, 2026 · Citations: 0
To address this gap, this ongoing study proposes a Hybrid Intelligence Loop, a human-in-the-loop (HITL) framework that grounds model assessment in human-written rationales from native-speaking annotators.
- LLM Readiness Harness: Evaluation, Observability, and CI Gates for LLM/RAG Applications
Alexandre Cristovão Maiorano · Mar 28, 2026 · Citations: 0
We present a readiness harness for LLM and RAG applications that turns evaluation into a deployment decision workflow.
- Inference-Time Structural Reasoning for Compositional Vision-Language Understanding
Amartya Bhattacharya · Mar 28, 2026 · Citations: 0
We present, a unified evaluation and augmentation framework benchmarking four architecturally diverse VLMs,CLIP, BLIP, LLaVA, and Qwen3-VL-8B-Thinking,on the Winoground benchmark under plain and scene-graph-augmented regimes.
- ASTRA: Mapping Art-Technology Institutions via Conceptual Axes, Text Embeddings, and Unsupervised Clustering
Joonhyung Bae · Mar 28, 2026 · Citations: 0
- PubMed Reasoner: Dynamic Reasoning-based Retrieval for Evidence-Grounded Biomedical Question Answering
Yiqing Zhang, Xiaozhong Liu, Fabricio Murai · Mar 28, 2026 · Citations: 0
Expert Verification
In this context, we introduce PubMed Reasoner, a biomedical QA agent composed of three stages: self-critic query refinement evaluates MeSH terms for coverage, alignment, and redundancy to enhance PubMed queries based on partial (metadata)…
- SACRED: A Faithful Annotated Multimedia Multimodal Multilingual Dataset for Classifying Connectedness Types in Online Spirituality
Qinghao Guan, Yuchen Pan, Donghao Li, Zishi Zhang, Yiyang Chen · Mar 28, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Self-evolving AI agents for protein discovery and directed evolution
Yang Tan, Lingrong Zhang, Mingchen Li, Yuanxi Yu, Bozitao Zhong · Mar 28, 2026 · Citations: 0
Multi Agent
Protein scientific discovery is bottlenecked by the manual orchestration of information and algorithms, while general agents are insufficient in complex domain projects.
- Mitigating Hallucination on Hallucination in RAG via Ensemble Voting
Zequn Xie, Zhengyang Sun · Mar 28, 2026 · Citations: 0
Multi Agent
VOTE-RAG includes: (1) Retrieval Voting, where multiple agents generate diverse queries in parallel and aggregate all retrieved documents; (2) Response Voting, where multiple agents independently generate answers based on the aggregated…
- SCOPE: Tree-based Self-Correcting Online Log Parsing via Syntactic-Semantic Collaboration
Dongyi Fan, Suqiong Zhang, Lili He, Ming Liu, Yifan Huo · Mar 28, 2026 · Citations: 0
Extensive evaluations on diverse benchmark datasets show that SCOPE outperforms state-of-the-art methods in both accuracy and efficiency.
- Structural Stress and Learned Helplessness in Afghanistan: A Multi-Layer Analysis of the AFSTRESS Dari Corpus
Jawid Ahmad Baktash, Mursal Dawodi, Nadira Ahmadi · Mar 28, 2026 · Citations: 0
We introduce AFSTRESS, the first multi-label corpus of self-reported stress narratives in Dari (Eastern Persian), comprising 737 responses collected from Afghan individuals during an ongoing humanitarian crisis.
- Rethinking Easy-to-Hard: Limits of Curriculum Learning in Post-Training for Deductive Reasoning
Maximilian Mordig, Andreas Opedal, Weiyang Liu, Bernhard Schölkopf · Mar 28, 2026 · Citations: 0
We present a systematic empirical study of CL for post-training of LLMs, using synthetic arithmetic and logical benchmarks where difficulty is characterized by reasoning complexity rather than surface-level proxies.
- LightMover: Generative Light Movement with Color and Intensity Controls
Gengze Zhou, Tianyu Wang, Soo Ye Kim, Zhixin Shu, Xin Yu · Mar 28, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- daVinci-LLM:Towards the Science of Pretraining
Yiwei Qin, Yixiu Liu, Tiantian Mi, Muhang Xie, Zhen Huang · Mar 28, 2026 · Citations: 0
Through 200+ controlled ablations, we establish that: processing depth systematically enhances capabilities, establishing it as a critical dimension alongside volume scaling; different domains exhibit distinct saturation dynamics,…
- Weakly Convex Ridge Regularization for 3D Non-Cartesian MRI Reconstruction
German Shâma Wache, Chaithya G R, Asma Tanabene, Sebastian Neumayer · Mar 28, 2026 · Citations: 0
- Learning to Predict Future-Aligned Research Proposals with Language Models
Heng Wang, Pengcheng Jiang, Jiashuo Sun, Zhiyi Shi, Haofei Yu · Mar 28, 2026 · Citations: 0
Across Llama-3.1 and Qwen2.5 models, future-aligned tuning improves future alignment over unaligned baselines (up to +10.6% overall FAS), and domain-expert human evaluation corroborates improved proposal quality.
- Routing Sensitivity Without Controllability: A Diagnostic Study of Fairness in MoE Language Models
Junhyeok Lee, Kyu Sung Choi · Mar 28, 2026 · Citations: 0
Pairwise Preference
FARE reveals that routing-level preference shifts are either unachievable (Mixtral, Qwen1.5, Qwen3), statistically non-robust (DeepSeekMoE), or accompanied by substantial utility cost (OLMoE, -4.4%p CrowS-Pairs at -6.3%p TQA).
- Story2Proposal: A Scaffold for Structured Scientific Paper Writing
Zhuoyang Qian, Wei Shi, Xu Lin, Li Ling, Meng Luo · Mar 28, 2026 · Citations: 0
Multi Agent
We introduce Story2Proposal, a contract-governed multi-agent framework that converts a research story into a structured manuscript through coordinated agents operating under a persistent shared visual contract.
- ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding
Jovana Kondic, Pengyuan Li, Dhiraj Joshi, Isaac Sanchez, Ben Wiesel · Mar 28, 2026 · Citations: 0
To capture the full spectrum of chart comprehension, ChartNet additionally includes specialized subsets encompassing human annotated data, real-world data, safety, and grounding.
- Debiasing Large Language Models toward Social Factors in Online Behavior Analytics through Prompt Knowledge Tuning
Hossein Salemi, Jitin Krishnan, Hemant Purohit · Mar 28, 2026 · Citations: 0
Large Language Models (LLMs), trained on human-generated corpora, may implicitly mimic this social attribution process in social contexts.
- Text Data Integration
Md Ataur Rahman, Dimitris Sacharidis, Oscar Romero, Sergi Nadal · Mar 28, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Unsupervised Behavioral Compression: Learning Low-Dimensional Policy Manifolds through State-Occupancy Matching
Andrea Fraschini, Davide Tenedini, Riccardo Zamboni, Mirco Mutti, Marcello Restelli · Mar 27, 2026 · Citations: 0
Long Horizon
Finally, we empirically validate the advantages of our contributions across multiple continuous control benchmarks.
- Introducing MELI: the Mandarin-English Language Interview Corpus
Suyuan Liu, Molly Babel · Mar 27, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- TAPS: Task Aware Proposal Distributions for Speculative Sampling
Mohamad Zbib, Mohamad Bazzi, Ammar Mohanna, Hasan Abed Al Kader Hammoud, Bernard Ghanem · Mar 27, 2026 · Citations: 0
Measured by acceptance length, task-specific training yields clear specialization: MathInstruct-trained drafts are strongest on reasoning benchmarks, while ShareGPT-trained drafts are strongest on MT-Bench.
- Pashto Common Voice: Building the First Open Speech Corpus for a 60-Million-Speaker Low-Resource Language
Hanif Rahman, Shafeeq ur Rehman · Mar 27, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.