- Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models
Shilin Yan, Jintao Tong, Hongwei Xue, Xiaojun Tang, Yangyang Wang · Apr 9, 2026 · Citations: 0
Automatic Metrics General
The advent of agentic multimodal models has empowered systems to actively interact with external environments.
- Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts
Jiayuan Ye, Vitaly Feldman, Kunal Talwar · Apr 9, 2026 · Citations: 0
Automatic Metrics Law
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- What do Language Models Learn and When? The Implicit Curriculum Hypothesis
Emmy Liu, Kaiser Sun, Millicent Li, Isabelle Lee, Lindia Tjuatja · Apr 9, 2026 · Citations: 0
Automatic Metrics MathLaw
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models
Thinh Pham, Nguyen Nguyen, Pratibha Zunjare, Weiyuan Chen, Yu-Min Tseng · Jun 1, 2025 · Citations: 0
Automatic Metrics General
We introduce SealQA, a new challenge benchmark for evaluating SEarch-Augmented Language models on fact-seeking questions where web search yields conflicting, noisy, or unhelpful results.
- The Detection-Extraction Gap: Models Know the Answer Before They Can Say It
Hanyang Wang, Mingxuan Zhu · Apr 8, 2026 · Citations: 0
Automatic Metrics Coding
Across five model configurations, two families, and three benchmarks, we find that 52--88% of chain-of-thought tokens are produced after the answer is recoverable from a partial prefix.
- Human-computer interactions predict mental health
Veith Weilnhammer, Jefferson Ortega, David Whitney · Nov 25, 2025 · Citations: 0
Automatic Metrics Medicine
Here, we show that everyday human-computer interactions encode mental health with biomarker accuracy.
- AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages
Lilian Wanzare, Cynthia Amol, zekiel Maina, Nelson Odhiambo, Hope Kerubo · Apr 9, 2026 · Citations: 0
Automatic Metrics Multilingual
Quality assurance operated at multiple layers, encompassing automated signal-to-noise ratio validation prior to recording and human review for content accuracy.
- KV Cache Offloading for Context-Intensive Tasks
Andrey Bocharnikov, Ivan Ermakov, Denis Kuznedelev, Vyacheslav Zhdanovskiy, Yegor Yershov · Apr 9, 2026 · Citations: 0
Automatic Metrics General
Prior evaluations have largely focused on tasks that do not require extracting large amounts of information from the context.
- Don't Overthink It: Inter-Rollout Action Agreement as a Free Adaptive-Compute Signal for LLM Agents
Khushal Sethi · Apr 9, 2026 · Citations: 0
Automatic Metrics Math
We introduce TrACE (Trajectorical Adaptive Compute via agrEement), a training-free controller that allocates LLM calls adaptively across agent timesteps by measuring inter-rollout action agreement.
- Stacked from One: Multi-Scale Self-Injection for Context Window Extension
Wei Han, Pan Zhou, Soujanya Poria, Shuicheng Yan · Mar 5, 2026 · Citations: 0
Automatic Metrics General
Across a comprehensive suite of long-context modeling and understanding benchmarks, \modelname~achieves performance superior or comparable to strong baselines, striking an optimal balance between efficiency and accuracy.
- Seeing Like an AI: How LLMs Apply (and Misapply) Wikipedia Neutrality Norms
Joshua Ashkinaze, Ruijia Guan, Laura Kurek, Eytan Adar, Ceren Budak · Jul 4, 2024 · Citations: 0
Human EvalAutomatic Metrics General
We evaluate LLMs' capacity to detect (Task 1) and correct (Task 2) biased Wikipedia edits according to Wikipedia's Neutral Point of View (NPOV) policy.
- HyperMem: Hypergraph Memory for Long-Term Conversations
Juwei Yue, Chuanrui Hu, Jiawei Sheng, Zuyi Zhou, Wenyuan Zhang · Apr 9, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics General
Long-term memory is essential for conversational agents to maintain coherence, track persistent tasks, and provide personalized interactions across extended dialogues.
- HumanLLM: Benchmarking and Improving LLM Anthropomorphism via Human Cognitive Patterns
Xintao Wang, Jian Yang, Weiyuan Li, Rui Xie, Jen-tse Huang · Jan 15, 2026 · Citations: 0
Automatic MetricsSimulation Env Coding
We present HumanLLM, a framework treating psychological patterns as interacting causal forces.
- Training Data Size Sensitivity in Unsupervised Rhyme Recognition
Petr Plecháč, Artjoms Šeļa, Silvie Cinková, Mirella De Sisto, Lara Nugues · Apr 9, 2026 · Citations: 0
Automatic Metrics Multilingual
This complicates automated rhymed recognition and evaluation, especially in multilingual context.
- Quantum Vision Theory Applied to Audio Classification for Deepfake Speech Detection
Khalid Zaman, Melike Sah, Anuwat Chaiwongyenc, Cem Direkoglu · Apr 9, 2026 · Citations: 0
Automatic Metrics General
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- TEC: A Collection of Human Trial-and-error Trajectories for Problem Solving
Xinkai Zhang, Jingtao Zhan, Yiqun Liu, Qingyao Ai · Apr 8, 2026 · Citations: 0
Automatic Metrics General
Trial-and-error is a fundamental strategy for humans to solve complex problems and a necessary capability for Artificial Intelligence (AI) systems operating in real-world environments.
- Guaranteeing Knowledge Integration with Joint Decoding for Retrieval-Augmented Generation
Zhengyi Zhao, Shubo Zhang, Zezhong Wang, Yuxi Zhang, Huimin Wang · Apr 9, 2026 · Citations: 0
Automatic Metrics General
Experiments on five QA benchmarks demonstrate that GuarantRAG improves accuracy by up to 12.1% and reduces hallucinations by 16.3% compared to standard and dynamic RAG baselines.
- MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale
Bin Wang, Tianyao He, Linke Ouyang, Fan Wu, Zhiyuan Zhao · Apr 6, 2026 · Citations: 0
Automatic Metrics General
At its core is a Data Engine co-designed around coverage, informativeness, and annotation accuracy: Diversity-and-Difficulty-Aware Sampling expands training data from under 10M to 65.5M samples while mitigating distribution shift;…
- Paragraph Segmentation Revisited: Towards a Standard Task for Structuring Speech
Fabian Retkowski, Alexander Waibel · Dec 30, 2025 · Citations: 0
Automatic Metrics General
First, we establish TEDPara (human-annotated TED talks) and YTSegPara (YouTube videos with synthetic labels) as the first benchmarks for the paragraph segmentation task.
- Kathleen: Oscillator-Based Byte-Level Text Classification Without Tokenization or Attention
George Fountzoulas · Apr 9, 2026 · Citations: 0
Automatic Metrics General
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- SAT: Balancing Reasoning Accuracy and Efficiency with Stepwise Adaptive Thinking
Weiyang Huang, Xuefeng Bai, Kehai Chen, Xinyang Chen, Yibin Chen · Apr 9, 2026 · Citations: 0
Automatic Metrics General
Experiments across 9 LRMs and 7 benchmarks show that SAT achieves up to 40% reduction in reasoning tokens while generally maintaining or improving accuracy.
- Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models
Shiho Matta, Lis Kanashiro Pereira, Peitao Han, Fei Cheng, Shigeru Kitazawa · Oct 30, 2025 · Citations: 0
Automatic Metrics Coding
We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans.
- Hallucination Detection and Evaluation of Large Language Model
Chenggong Zhang, Haopeng Wang, Hexi Meng · Dec 27, 2025 · Citations: 0
Automatic Metrics General
To address this, we integrate the Hughes Hallucination Evaluation Model (HHEM), a lightweight classification-based framework that operates independently of LLM-based judgments, significantly improving efficiency while maintaining high…
- Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Qiyao Ma, Dechen Gao, Rui Cai, Boqi Zhao, Hanchu Zhou · Apr 8, 2026 · Citations: 0
Human EvalAutomatic Metrics General
Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values.
- Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images
Yuechen Jiang, Enze Zhang, Md Mohsinul Kabir, Qianqian Xie, Stavroula Golfomitsou · Apr 8, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics General
We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations.
- Evaluating In-Context Translation with Synchronous Context-Free Grammar Transduction
Jackson Petty, Jaulie Goe, Tal Linzen · Apr 8, 2026 · Citations: 0
Automatic Metrics Multilingual
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- A Systematic Study of Retrieval Pipeline Design for Retrieval-Augmented Medical Question Answering
Nusrat Sultana, Abdullah Muhammad Moosa, Kazi Afzalur Rahman, Sajal Chandra Banik · Apr 8, 2026 · Citations: 0
Automatic Metrics Medicine
This study presents a systematic evaluation of retrieval-augmented medical question answering using the MedQA USMLE benchmark and a structured textbook-based knowledge corpus.
- ClickGuard: A Trustworthy Adaptive Fusion Framework for Clickbait Detection
Chhavi Dhiman, Naman Chawla, Riya Dhami, Gaurav Kumar, Ganesh Naik · Apr 8, 2026 · Citations: 0
Automatic Metrics Coding
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Joint Optimization of Reasoning and Dual-Memory for Self-Learning Diagnostic Agent
Bingxuan Li, Simo Du, Yue Guo · Apr 8, 2026 · Citations: 0
Automatic Metrics Medicine
We propose SEA, a self-learning diagnostic agent with cognitively inspired dual-memory module.
- UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding
Shuquan Lian, Yuhang Wu, Jia Ma, Yifan Ding, Zihan Song · Jul 29, 2025 · Citations: 0
Automatic Metrics Coding
The emergence of Multimodal Large Language Models (MLLMs) has driven significant advances in Graphical User Interface (GUI) agent capabilities.
- TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, Yun-Nung Chen · Apr 8, 2026 · Citations: 0
Automatic Metrics General
As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces.
- LaScA: Language-Conditioned Scalable Modelling of Affective Dynamics
Kosmas Pinitas, Ilias Maglogiannis · Apr 8, 2026 · Citations: 0
Automatic Metrics General
Predicting affect in unconstrained environments remains a fundamental challenge in human-centered AI.
- Yale-DM-Lab at ArchEHR-QA 2026: Deterministic Grounding and Multi-Pass Evidence Alignment for EHR Question Answering
Elyas Irankhah, Samah Fodeh · Apr 8, 2026 · Citations: 0
Automatic Metrics Medicine
Third, results on the development set show that alignment accuracy is mainly limited by reasoning.
- Graph Representation-based Model Poisoning on the Heterogeneous Internet of Agents
Hanlin Cai, Houtianfu Wang, Haofan Dong, Kai Li, Sai Zou · Nov 10, 2025 · Citations: 0
Automatic Metrics General
Internet of Agents (IoA) envisions a unified, agent-centric paradigm where heterogeneous large language model (LLM) agents can interconnect and collaborate at scale.
- IndoBERT-Sentiment: Context-Conditioned Sentiment Classification for Indonesian Text
Muhammad Apriandito Arya Saputra, Andry Alamsyah, Dian Puteri Ramadhani, Thomhert Suprapto Siadari, Hanif Fakhrurroja · Apr 8, 2026 · Citations: 0
Automatic Metrics General
In a head-to-head evaluation against three widely used general-purpose Indonesian sentiment models on the same test set, IndoBERT-Sentiment outperforms the best baseline by 35.6 F1 points.
- Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models
Md Motaleb Hossen Manik, Ge Wang · Apr 8, 2026 · Citations: 0
Automatic Metrics Math
We present a controlled empirical benchmark of seven recent reasoning-oriented instruction-tuned models spanning dense and MoE designs, namely Gemma-4-E2B, Gemma-4-E4B, Gemma-4-26B-A4B, Phi-4-mini-reasoning, Phi-4-reasoning, Qwen3-8B, and…
- MARS: Enabling Autoregressive Models Multi-Token Generation
Ziqi Jin, Lei Wang, Ziwei Luo, Aixin Sun · Apr 8, 2026 · Citations: 0
Automatic Metrics General
When generating one token per forward pass, MARS matches or exceeds the AR baseline on six standard benchmarks.
- iTAG: Inverse Design for Natural Text Generation with Accurate Causal Graph Annotations
Wenshuo Wang, Boyu Cao, Nan Zhuang, Wei Li · Apr 8, 2026 · Citations: 0
Automatic Metrics General
This suggests that iTAG-generated data can serve as a practical surrogate for scalable benchmarking of text-based causal discovery algorithms.
- Do We Need Distinct Representations for Every Speech Token? Unveiling and Exploiting Redundancy in Large Speech Language Models
Bajian Xiang, Tingwei Guo, Xuan Chen, Yang Han · Apr 8, 2026 · Citations: 0
Automatic Metrics General
Extensive evaluations across three tasks demonstrate that our approach reduces prefilling FLOPs by 27.48\% while maintaining competitive accuracy.
- MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors
Xiaotian Luo, Xun Jiang, Jiangcheng Wu · Apr 8, 2026 · Citations: 0
Automatic Metrics Medicine
Interactive medical dialogue benchmarks have shown that LLM diagnostic accuracy degrades significantly when interacting with non-cooperative patients, yet existing approaches either apply adversarial behaviors without graded severity or…
- Cognitive Loop of Thought: Reversible Hierarchical Markov Chain for Efficient Mathematical Reasoning
Jia-Chen Zhang, Zheng Zhou, Yu-Jie Xiong · Apr 8, 2026 · Citations: 0
Automatic Metrics Math
Inspired by human cognitive processes, we introduce a backward verification mechanism at each hierarchical layer.
- Beyond Accuracy: Diagnosing Algebraic Reasoning Failures in LLMs Across Nine Complexity Dimensions
Parth Patil, Dhruv Kumar, Yash Sinha, Murari Mandal · Apr 8, 2026 · Citations: 0
Automatic Metrics General
Algebraic reasoning remains one of the most informative stress tests for large language models, yet current benchmarks provide no mechanism for attributing failure to a specific cause.
- SimSiam Naming Game: A Unified Approach for Representation Learning and Emergent Communication
Nguyen Le Hoang, Tadahiro Taniguchi, Fang Tianwei, Akira Taniguchi · Oct 29, 2024 · Citations: 0
Automatic Metrics General
Emergent Communication (EmCom) investigates how agents develop symbolic communication through interaction without predefined language.
- How Long Reasoning Chains Influence LLMs' Judgment of Answer Factuality
Minzhu Tu, Shiyu Ni, Keping Bi · Apr 8, 2026 · Citations: 0
Human EvalAutomatic Metrics Math
Large language models (LLMs) has been widely adopted as a scalable surrogate for human evaluation, yet such judges remain imperfect and susceptible to surface-level biases.
- Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents
Heng Zhou, Zelin Tan, Zhemeng Zhang, Yutao Fan, Yibing Lin · Apr 8, 2026 · Citations: 0
Automatic Metrics General
When an LLM-based agent improves on a task, is the gain from the model itself or from the reasoning paradigm wrapped around it?
- PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses
Minki Hong, Eunsoo Lee, Sohyun Park, Jihie Kim · Mar 11, 2026 · Citations: 0
Automatic Metrics Medicine
We propose PEEM (Prompt Engineering Evaluation Metrics), a unified framework for joint and interpretable evaluation of both prompts and responses.
- SQLStructEval: Structural Evaluation of LLM Text-to-SQL Generation
Yixi Zhou, Fan Zhang, Zhiqiao Guo, Yu Chen, Haipeng Zhang · Apr 8, 2026 · Citations: 0
Automatic Metrics Coding
Despite strong performance on Text-to-SQL benchmarks, it remains unclear whether LLM-generated SQL programs are structurally reliable.
- Adaptive Prompt Structure Factorization: A Framework for Self-Discovering and Optimizing Compositional Prompt Programs
Haoyue Liu, Zhichao Wang, Yongxin Guo, Haoran Shou, Xiaoying Tang · Apr 8, 2026 · Citations: 0
Automatic Metrics General
Across multiple advanced reasoning benchmarks, aPSF outperforms strong baselines including principle-aware optimizers, improving accuracy by up to +2.16 percentage points on average, and reduces optimization cost by 45--87% tokens on…
- A Graph-Enhanced Defense Framework for Explainable Fake News Detection with LLM
Bo Wang, Jing Ma, Hongzhan Lin, Zhiwei Yang, Ruichao Yang · Apr 8, 2026 · Citations: 0
Automatic Metrics General
Explainable fake news detection aims to assess the veracity of news claims while providing human-friendly explanations.
- Feedback Adaptation for Retrieval-Augmented Generation
Jihwan Bang, Seunghan Yang, Kyuhong Shim, Simyung Chang, Juntae Lee · Apr 8, 2026 · Citations: 0
Automatic Metrics General
Existing evaluation protocols focus on overall accuracy and fail to capture how systems adapt after feedback is introduced.
- SHAPE: Stage-aware Hierarchical Advantage via Potential Estimation for LLM Reasoning
Zhengyang Ai, Zikang Shan, Xiaodong Ai, Jingxian Tang, Hangkai Hu · Apr 8, 2026 · Citations: 0
Automatic Metrics Math
Extensive experiments in math reasoning across three base models and five benchmarks demonstrate that SHAPE achieves an average accuracy gain of 3% with 30% reduced token consumption.
- DiffuMask: Diffusion Language Model for Token-level Prompt Pruning
Caleb Zheng, Jyotika Singh, Fang Tu, Weiyi Sun, Sujeeth Bharadwaj · Apr 8, 2026 · Citations: 0
Automatic Metrics General
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Scientific Knowledge-driven Decoding Constraints Improving the Reliability of LLMs
Maotian Ma, Zheni Zeng, Zhenghao Liu, Yukun Yan · Apr 8, 2026 · Citations: 0
Automatic Metrics MedicineCoding
Though scientific theories and rules can efficiently direct the behaviors of human manipulators, LLMs still do not utilize these highly-condensed knowledge sufficiently through training or prompting.
- Does a Global Perspective Help Prune Sparse MoEs Elegantly?
Zeliang Zhang, Nikhil Ghosh, Jiani Liu, Bin Yu, Xiaodong Liu · Apr 8, 2026 · Citations: 0
Automatic Metrics Law
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- PACIFIC: Can LLMs Discern the Traits Influencing Your Preferences? Evaluating Personality-Driven Preference Alignment in LLMs
Tianyu Zhao, Siqi Li, Yasser Shoukry, Salma Elmalaki · Feb 6, 2026 · Citations: 0
Automatic Metrics General
Based on these findings, we introduce PACIFIC (Preference Alignment Choices Inference for Five-factor Identity Characterization), a personality-labeled preference dataset containing 1200 preference statements spanning diverse domains (e.g.,…
- ValueGround: Evaluating Culture-Conditioned Visual Value Grounding in MLLMs
Zhipin Wang, Christoph Leiter, Christian Frey, Mohamed Hesham Ibrahim Abdalla, Josif Grabocka · Apr 7, 2026 · Citations: 0
Automatic Metrics General
Yet existing evaluations of cultural values in language models are almost entirely text-only, making it unclear whether models can ground culture-conditioned judgments when response options are visualized.
- Multi-objective Evolutionary Merging Enables Efficient Reasoning Models
Mario Iacobelli, Adrian Robert Minut, Tommaso Mencattini, Donato Crisostomi, Andrea Santilli · Apr 7, 2026 · Citations: 0
Automatic Metrics Math
Comprehensive experiments across 1.5B, 7B, and 14B parameter scales on six mathematical reasoning benchmarks demonstrate that Evo-L2S can reduce the length of generated reasoning traces by over 50% while preserving, or even improving, the…
- Context-Aware Dialectal Arabic Machine Translation with Interactive Region and Register Selection
Afroza Nowshin, Prithweeraj Acharjee Porag, Haziq Jeelani, Fayeq Jeelani Syed · Apr 7, 2026 · Citations: 0
Automatic Metrics Multilingual
Through a combination of automatic evaluation and qualitative analysis, we observe an apparent accuracy-fidelity trade-off: high-resource baselines such as NLLB (No Language Left Behind) achieve higher aggregate BLEU scores (13.75) by…
- Team Fusion@ SU@ BC8 SympTEMIST track: transformer-based approach for symptom recognition and linking
Georgi Grazhdanski, Sylvia Vassileva, Ivan Koychev, Svetla Boytcheva · Apr 7, 2026 · Citations: 0
Automatic Metrics Multilingual
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Application-Driven Pedagogical Knowledge Optimization of Open-Source LLMs via Reinforcement Learning and Supervised Fine-Tuning
Navan Preet Singh, Xiaokun Wang, Anurag Garikipati, Madalina Ciobanu, Qingqing Mao · Apr 7, 2026 · Citations: 0
Automatic Metrics General
These models remarkably achieve high enough accuracy on the Cross-Domain Pedagogical Knowledge (CDPK) Benchmark to establish new state-of-the-art (SOTA) results across the interactive Pedagogy Benchmark Leaderboard and surpass significantly…