- LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias
Filip J. Kucia, Anirban Chakraborty, Anna Wróblewska · Mar 31, 2026 · Citations: 0
Rubric Rating Human Eval
We present a systematic evaluation of instruction-tuned LLMs across three open essay-scoring datasets (ASAP 2.0, ELLIPSE, and DREsS) that cover both holistic and analytic scoring.
- TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, Yun-Nung Chen · Apr 8, 2026 · Citations: 0
Red Team Automatic Metrics Long Horizon
As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces.
- Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study
Yosuke Yamagishi, Atsushi Takamatsu, Yasunori Hamaguchi, Tomohiro Kikuchi, Shouhei Hanaoka · Apr 2, 2026 · Citations: 0
Pairwise Preference Llm As JudgeAutomatic Metrics
A board-certified radiologist and a radiology resident independently performed blinded pairwise evaluations across 4 criteria: terminology accuracy, readability, overall quality, and radiologist-style authenticity.
- Self-Preference Bias in Rubric-Based Evaluation of Large Language Models
José Pombal, Ricardo Rei, André F. T. Martins · Apr 8, 2026 · Citations: 0
Pairwise PreferenceRubric Rating Llm As Judge
We present the first study of SPB in rubric-based evaluation, an increasingly popular benchmarking paradigm where judges issue binary verdicts on individual evaluation criteria, instead of assigning holistic scores or rankings.
- A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models
Maria Mahbub, Gregory M. Dams, Josh Arnold, Caitlin Rizy, Sudarshan Srinivasan · Apr 7, 2026 · Citations: 0
Expert Verification Automatic Metrics
Conventional evaluation methods rely heavily on annotation-intensive reference standards or incomplete structured data, limiting feasibility at population scale.
- HyperMem: Hypergraph Memory for Long-Term Conversations
Juwei Yue, Chuanrui Hu, Jiawei Sheng, Zuyi Zhou, Wenyuan Zhang · Apr 9, 2026 · Citations: 0
Pairwise Preference Llm As JudgeAutomatic Metrics
Long-term memory is essential for conversational agents to maintain coherence, track persistent tasks, and provide personalized interactions across extended dialogues.
- RuleForge: Automated Generation and Validation for Web Vulnerability Detection at Scale
Ayush Garg, Sophia Hager, Jacob Montiel, Aditya Tiwari, Michael Gentile · Apr 2, 2026 · Citations: 0
Expert Verification Llm As JudgeAutomatic Metrics
This paper focuses on RuleForge's architecture and operational deployment for CVE-related threat detection, with particular emphasis on our novel LLM-as-a-judge (Large Language Model as judge) confidence validation system and systematic…
- Signals: Trajectory Sampling and Triage for Agentic Interactions
Shuguang Chen, Adil Hafeez, Salman Paracha · Apr 1, 2026 · Citations: 0
Pairwise Preference Automatic Metrics Long Horizon
We propose a lightweight, signal-based framework for triaging agentic interaction trajectories.
- Paper Reconstruction Evaluation: Evaluating Presentation and Hallucination in AI-written Papers
Atsuyuki Miyai, Mashiro Toyooka, Zaiying Zhao, Kenta Watanabe, Toshihiko Yamasaki · Apr 1, 2026 · Citations: 0
Rubric Rating Automatic Metrics
We introduce Paper Reconstruction Evaluation (PaperRecon), an evaluation framework in which an overview (overview.md) is created from an existing paper, after which an agent generates a full paper based on the overview and minimal…
- Do Phone-Use Agents Respect Your Privacy?
Zhengyang Tang, Ke Ji, Xidong Wang, Zihan Ye, Xinyuan Wang · Apr 1, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
We study whether phone-use agents respect privacy while completing benign mobile tasks.
- LUDOBENCH: Evaluating LLM Behavioural Decision-Making Through Spot-Based Board Game Scenarios in Ludo
Ojas Jain, Dhruv Kumar · Apr 7, 2026 · Citations: 0
Simulation Env Multi Agent
We introduce LudoBench, a benchmark for evaluating LLM strategic reasoning in Ludo, a stochastic multi-agent board game whose dice mechanics, piece capture, safe-square navigation, and home-path progression introduce meaningful planning…
- Wiggle and Go! System Identification for Zero-Shot Dynamic Rope Manipulation
Arthur Jakobsson, Abhinav Mahajan, Karthik Pullalarevu, Krishna Suresh, Yunchao Yao · Apr 23, 2026 · Citations: 0
Automatic MetricsSimulation Env Long Horizon
To mitigate this, we present a novel approach that leverages learned simulation priors to inform goal-conditioned dynamic manipulation of ropes for efficient and accurate task execution.
- How Much LLM Does a Self-Revising Agent Actually Need?
Sungwoo Jung, Seonil Son · Apr 8, 2026 · Citations: 0
Critique Edit Automatic Metrics
Recent LLM-based agents often place world modeling, planning, and reflection inside a single language model loop.
- MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control
Yuchi Wang, Haiyang Yu, Weikang Bian, Jiefeng Long, Xiao Liang · Apr 7, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
Experiments on the MMEB-V2 benchmark demonstrate that our model achieves a score of 71.2 with only 4B parameters, establishing a new state-of-the-art while significantly reducing reasoning overhead and inference latency.
- Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives
Changgeon Ko, Jisu Shin, Hoyun Song, Huije Lee, Eui Jun Hwang · Apr 7, 2026 · Citations: 0
Automatic MetricsSimulation Env Multi Agent
Large language model (LLM) agents are increasingly acting as human delegates in multi-agent environments, where a representative agent integrates diverse peer perspectives to make a final decision.
- QED-Nano: Teaching a Tiny Model to Prove Hard Theorems
LM-Provers, Yuxiao Qu, Amrith Setlur, Jasper Dekoninck, Edward Beeching · Apr 6, 2026 · Citations: 0
Rubric Rating Automatic Metrics
To support further research on open mathematical reasoning, we release the full QED-Nano pipeline, including the QED-Nano and QED-Nano-SFT models, the FineProofs-SFT and FineProofs-RL datasets, and the training and evaluation code.
- Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning
Yuhang Wu, Xiangqing Shen, Fanfan Wang, Cangqi Zhou, Zhen Wu · Apr 2, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
However, current reranking models are typically optimized on static human annotated relevance labels in isolation, decoupled from the downstream generation process.
- Development and multi-center evaluation of domain-adapted speech recognition for human-AI teaming in real-world gastrointestinal endoscopy
Ruijie Yang, Yan Zhu, Peiyao Fu, Te Luo, Zhihua Wang · Apr 2, 2026 · Citations: 0
Expert Verification Automatic Metrics
Automatic speech recognition (ASR) is a critical interface for human-AI interaction in gastrointestinal endoscopy, yet its reliability in real-world clinical settings is limited by domain-specific terminology and complex acoustic…
- Preference learning in shades of gray: Interpretable and bias-aware reward modeling for human preferences
Simona-Vasilica Oprea, Adela Bâra · Apr 1, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
Using the Anthropic HHRLHF dataset, we evaluate ten diverse large language models LLMs under a standard pairwise preference setting, where baseline performance remains below 0.74 ROC AUC, highlighting the difficulty of the task.
- MemRerank: Preference Memory for Personalized Product Reranking
Zhiyuan Peng, Xuyang Wu, Huaixiao Tou, Yi Fang, Yu Gong · Mar 31, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
LLM-based shopping agents increasingly rely on long purchase histories and multi-turn interactions for personalization, yet naively appending raw history to prompts is often ineffective due to noise, length, and relevance mismatch.
- LLM-as-a-Judge for Time Series Explanations
Preetham Sivalingam, Murari Mandal, Saurabh Deshpande, Dhruv Kumar · Apr 2, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics
Although modern models generate textual interpretations of numerical signals, existing evaluation methods are limited: reference based similarity metrics and consistency checking models require ground truth explanations, while traditional…
- Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning
Shiwan Zhao, Zhihu Wang, Xuyang Zhao, Jiaming Zhou, Caiyue Xu · Apr 9, 2026 · Citations: 0
Pairwise Preference Long Horizon
Recent progress spans supervised fine-tuning (SFT), preference optimization, reinforcement learning (RL), process supervision, verifier-guided methods, distillation, and multi-stage pipelines.
- The Ultimate Tutorial for AI-driven Scale Development in Generative Psychometrics: Releasing AIGENIE from its Bottle
Lara Russell-Lasalandra, Hudson Golino, Luis Eduardo Garrido, Alexander P. Christensen · Mar 30, 2026 · Citations: 0
Critique Edit Tool Use
Psychological scale development has traditionally required extensive expert involvement, iterative revision, and large-scale pilot testing before psychometric evaluation can begin.
- Navigating Large-Scale Document Collections: MuDABench for Multi-Document Analytical QA
Zhanli Li, Yixuan Cao, Lvzhou Luo, Ping Luo · Apr 24, 2026 · Citations: 0
Automatic Metrics Multi Agent
We present MuDABench, a benchmark for multi-document analytical QA, where questions require extracting and synthesizing information across numerous documents to perform quantitative analysis.
- Don't Overthink It: Inter-Rollout Action Agreement as a Free Adaptive-Compute Signal for LLM Agents
Khushal Sethi · Apr 9, 2026 · Citations: 0
Automatic Metrics Long Horizon
We introduce TrACE (Trajectorical Adaptive Compute via agrEement), a training-free controller that allocates LLM calls adaptively across agent timesteps by measuring inter-rollout action agreement.
- Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge
Xin Sun, Di Wu, Sijing Qin, Isao Echizen, Abdallah El Ali · Apr 7, 2026 · Citations: 0
Pairwise Preference Llm As Judge
Large language models (LLMs) are increasingly used as automated evaluators (LLM-as-a-Judge).
- MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents
Shu Wang, Edwin Yu, Oscar Love, Tom Zhang, Tom Wong · Apr 6, 2026 · Citations: 0
Automatic Metrics Long Horizon
Large Language Model (LLM) agents require persistent memory to maintain personalization, factual continuity, and long-horizon reasoning, yet standard context-window and retrieval-augmented generation (RAG) pipelines degrade over…
- OSCAR: Orchestrated Self-verification and Cross-path Refinement
Yash Shah, Abhijit Chakraborty, Naresh Kumar Devulapally, Vishnu Lokhande, Vivek Gupta · Apr 2, 2026 · Citations: 0
Automatic Metrics Long Horizon
We introduce a suite of trajectory-level assessments, including a cross-chain divergence-at-hallucination (CDH) metric, for principled comparison of localization methods.
- Asymmetric Actor-Critic for Multi-turn LLM Agents
Shuli Jiang, Zhaoyang Zhang, Yi Zhang, Shuo Yang, Wei Xia · Mar 31, 2026 · Citations: 0
Automatic Metrics Long Horizon
In many real-world applications, agents must succeed in one-shot settings where retries are impossible.
- MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome
Fangda Ye, Yuxin Hu, Pengxiang Zhu, Yibo Li, Ziqi Jin · Mar 30, 2026 · Citations: 0
Rubric Rating
Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs.
- QDTraj: Exploration of Diverse Trajectory Primitives for Articulated Objects Robotic Manipulation
Mathilde Kappel, Mahdi Khoramshahi, Louis Annabi, Faiz Ben Amar, Stéphane Doncieux · Apr 24, 2026 · Citations: 0
Simulation Env Long Horizon
To do so, we propose a method based on Quality-Diversity algorithms that leverages sparse reward exploration in order to generate a set of diverse and high-performing trajectory primitives for a given manipulation task.
- Strengthening Human-Centric Chain-of-Thought Reasoning Integrity in LLMs via a Structured Prompt Framework
Jiling Zhou, Aisvarya Adeseye, Seppo Virtanen, Antti Hakkala, Jouni Isoaho · Apr 6, 2026 · Citations: 0
Human EvalAutomatic Metrics
However, its reliability in security-sensitive analytical tasks remains insufficiently examined, particularly under structured human evaluation.
- SEAL: An Open, Auditable, and Fair Data Generation Framework for AI-Native 6G Networks
Sunder Ali Khowaja, Kapal Dev, Engin Zeydan, Madhusanka Liyanage · Apr 2, 2026 · Citations: 0
Automatic MetricsSimulation Env
In this regard, we propose the Synthetic Data Generation with Ethics Audit Loop (SEAL) framework, which extends baseline modular pipelines with an Ethical and Regulatory Compliance by Design (ERCD) module and a Federated Learning (FL)…
- JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency
Aichen Cai, Anmeng Zhang, Anyu Li, Bo Zhang, Bohua Cai · Apr 3, 2026 · Citations: 0
Pairwise Preference
JoyAI-LLM Flash is pretrained on a massive corpus of 20 trillion tokens and further optimized through a rigorous post-training pipeline, including supervised fine-tuning (SFT), Direct Preference Optimization (DPO), and large-scale…
- EvoAgent: An Evolvable Agent Framework with Skill Learning and Multi-Agent Delegation
Aimin Zhang, Jiajing Guo, Fuwei Jia, Chen Lv, Boyu Wang · Apr 22, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics Multi Agent
Experimental results based on real-world foreign trade scenarios demonstrate that, after integrating EvoAgent, GPT5.2 achieves significant improvements in professionalism, accuracy, and practical utility.
- ActionParty: Multi-Subject Action Binding in Generative Video Games
Alexander Pondaven, Ziyi Wu, Igor Gilitschenski, Philip Torr, Sergey Tulyakov · Apr 2, 2026 · Citations: 0
Automatic MetricsSimulation Env Multi Agent
However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene.
- ThinknCheck: Grounded Claim Verification with Compact, Reasoning-Driven, and Interpretable Models
Delip Rao, Feijiang Han, Chris Callison-Burch · Apr 2, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
By contrast, zero-shot chain-of-thought on the base Gemma3-1B harms accuracy relative to direct answers, and preference optimization with a simple format+accuracy reward underperforms supervised reasoning.
- Multi-Agent Dialectical Refinement for Enhanced Argument Classification
Jakub Bąba, Jarosław A. Chudziak · Mar 29, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics Multi Agent
We introduce MAD-ACC (Multi-Agent Debate for Argument Component Classification), a framework that leverages dialectical refinement to resolve classification uncertainty.
- Diff-KD: Diffusion-based Knowledge Distillation for Collaborative Perception under Corruptions
Pengcheng Lyu, Chaokun Zhang, Gong Chen, Tao Tang, Zhaoxiang Luo · Apr 2, 2026 · Citations: 0
Automatic Metrics Multi Agent
Multi-agent collaborative perception enables autonomous systems to overcome individual sensing limits through collective intelligence.
- Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim Verification
Masnun Nuha Chowdhury, Nusrat Jahan Beg, Umme Hunny Khan, Syed Rifat Raiyan, Md Kamrul Hasan · Mar 30, 2026 · Citations: 0
Automatic Metrics Multi Agent
We propose a courtroom-style multi-agent framework, PROClaim, that reformulates verification as a structured, adversarial deliberation.
- Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems
Meghana Karnam, Ananya Joshi · Apr 24, 2026 · Citations: 0
Llm As Judge Long Horizon
Emerging AI systems in behavioral health and psychiatry use multi-step or multi-agent LLM pipelines for tasks like assessing self-harm risk and screening for depression.
- Weakly Supervised Distillation of Hallucination Signals into Transformer Representations
Shoaib Sadiq Salehmohamed, Jinal Prashant Thakkar, Hansika Aredla, Shaik Mohammed Omar, Shalmali Ayachit · Apr 7, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics
We introduce a weak supervision framework that combines three complementary grounding signals: substring matching, sentence embedding similarity, and an LLM as a judge verdict to label generated responses as grounded or hallucinated without…
- Optimsyn: Influence-Guided Rubrics Optimization for Synthetic Data Generation
Zhiting Fan, Ruizhe Chen, Tianxiang Hu, Ru Peng, Zenan Huang · Apr 1, 2026 · Citations: 0
Rubric RatingCritique Edit
However, high-quality SFT data in knowledge-intensive domains such as humanities, social sciences, medicine, law, and finance is scarce because expert curation is expensive, privacy constraints are strict, and label consistency is hard to…
- TSUBASA: Improving Long-Horizon Personalization via Evolving Memory and Self-Learning with Context Distillation
Xinliang Frederick Zhang, Lu Wang · Apr 9, 2026 · Citations: 0
Pairwise Preference Long Horizon
Personalized large language models (PLLMs) have garnered significant attention for their ability to align outputs with individual's needs and preferences.
- From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company
Zhengxu Yu, Yu Fu, Zhiyuan He, Yuxuan Huang, Lee Ka Yiu · Apr 24, 2026 · Citations: 0
Automatic Metrics Multi Agent
Individual agent capabilities have advanced rapidly through modular skills and tool integrations, yet multi-agent systems remain constrained by fixed team structures, tightly coupled coordination logic, and session-bound learning.
- Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents
Seyed Moein Abtahi, Rasa Rahnema, Hetkumar Patel, Neel Patel, Majid Fekri · Apr 23, 2026 · Citations: 0
Automatic Metrics Long Horizon
The transition from stateless language model inference to persistent, multi session autonomous agents has revealed memory to be a primary architectural bottleneck in the deployment of production grade agentic systems.
- PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory
Zhifei Xie, Zongzheng Hu, Fangda Ye, Xin Zhang, Haobo Chai · Apr 9, 2026 · Citations: 0
Automatic Metrics Long Horizon
Prior work remains largely confined to laboratory settings, leaving a clear gap in real-world proactive agent: depth, complexity, ambiguity, precision and real-time constraints.
- SkillX: Automatically Constructing Skill Knowledge Bases for Agents
Chenxi Wang, Zhuoyun Yu, Xin Xie, Wuguannan Yao, Runnan Fang · Apr 6, 2026 · Citations: 0
Automatic Metrics Long Horizon
Learning from experience is critical for building capable large language model (LLM) agents, yet prevailing self-evolving paradigms remain inefficient: agents learn in isolation, repeatedly rediscover similar behaviors from limited…
- $\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution
Muyu He, Adit Jain, Anand Kumar, Vincent Tu, Soumyadeep Bakshi · Apr 1, 2026 · Citations: 0
Automatic Metrics Long Horizon
As LLM agents tackle increasingly complex tasks, a critical question is whether they can maintain strategic coherence over long horizons: planning under uncertainty, learning from delayed feedback, and adapting when early mistakes compound.
- Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
Meng Chu, Xuan Billy Zhang, Kevin Qinghong Lin, Lingdong Kong, Jize Zhang · Apr 24, 2026 · Citations: 0
Simulation Env Long Horizon
Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities.
- Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models
Shilin Yan, Jintao Tong, Hongwei Xue, Xiaojun Tang, Yangyang Wang · Apr 9, 2026 · Citations: 0
Automatic Metrics Tool Use
The advent of agentic multimodal models has empowered systems to actively interact with external environments.
- SHAPE: Stage-aware Hierarchical Advantage via Potential Estimation for LLM Reasoning
Zhengyang Ai, Zikang Shan, Xiaodong Ai, Jingxian Tang, Hangkai Hu · Apr 8, 2026 · Citations: 0
Automatic Metrics Long Horizon
Extensive experiments in math reasoning across three base models and five benchmarks demonstrate that SHAPE achieves an average accuracy gain of 3% with 30% reduced token consumption.
- Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing
Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng · Apr 2, 2026 · Citations: 0
Automatic Metrics Long Horizon
Evaluated across five benchmarks and two model scales, SRPO achieves both the rapid early improvement of SDPO and the long-horizon stability of GRPO.
- LEO: Graph Attention Network based Hybrid Multi Sensor Extended Object Fusion and Tracking for Autonomous Driving Applications
Mayank Mayank, Bharanidhar Duraisamy, Florian Geiss · Apr 2, 2026 · Citations: 0
Automatic Metrics Long Horizon
Evaluations on the Mercedes-Benz DRIVE PILOT SAE L3 dataset demonstrate real-time computational efficiency suitable for production systems; additional validation on public datasets such as View of Delft (VoD) further confirms cross-dataset…
- From Multi-Agent to Single-Agent: When Is Skill Distillation Beneficial?
Binyan Xu, Dong Fang, Haitao Li, Kehuan Zhang · Apr 2, 2026 · Citations: 0
Automatic Metrics Multi Agent
Multi-agent systems (MAS) tackle complex tasks by distributing expertise, though this often comes at the cost of heavy coordination overhead, context fragmentation, and brittle phase ordering.
- Cognitive Friction: A Decision-Theoretic Framework for Bounded Deliberation in Tool-Using Agents
Davide Di Gioia · Mar 31, 2026 · Citations: 0
Automatic Metrics Tool Use
Autonomous tool-using agents in networked environments must decide which information source to query and when to stop querying and act.
- AgentSwing: Adaptive Parallel Context Management Routing for Long-Horizon Web Agents
Zhaopeng Feng, Liangcai Su, Zhen Zhang, Xinyu Wang, Xiaotian Zhang · Mar 29, 2026 · Citations: 0
Automatic Metrics Long Horizon
As large language models (LLMs) evolve into autonomous agents for long-horizon information-seeking, managing finite context capacity has become a critical bottleneck.
- Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework
Tharindu Kumarage, Lisa Bauer, Yao Ma, Dan Rosen, Yashasvi Raghavendra Guduri · Apr 23, 2026 · Citations: 0
Rubric Rating
To address this gap, we introduce ESRRSim, a taxonomy-driven agentic framework for automated behavioral risk evaluation.
- Optimal Question Selection from a Large Question Bank for Clinical Field Recovery in Conversational Psychiatric Intake
Guan Gui, Peter Zandi, Jacob Taylor, Ananya Joshi · Apr 23, 2026 · Citations: 0
Expert Verification
We also introduce a task-specific question-selection benchmark based on a bank of 655 clinician-authored intake questions and corresponding synthetic patient vignettes with 5 different behavioral conditions.
- Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts
Haolei Xu, Haiwen Hong, Hongxing Li, Rui Zhou, Yang Zhang · Apr 9, 2026 · Citations: 0
Expert Verification
Experiments on three multimodal MoE models across six benchmarks demonstrate consistent improvements, with gains of up to 3.17% on complex visual reasoning tasks.