- SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery
David Anugraha, Vishakh Padmakumar, Diyi Yang · Feb 24, 2026 · Citations: 0
Expert Verification Automatic Metrics Multi Agent
Based on this formulation, we introduce SparkMe, a multi-agent LLM interviewer that performs deliberative planning via simulated conversation rollouts to select questions with high expected utility.
- Document Reconstruction Unlocks Scalable Long-Context RLVR
Yao Xiao, Lei Wang, Yue Deng, Guanzheng Chen, Ziqi Jin · Feb 9, 2026 · Citations: 0
Rubric Rating Automatic Metrics
However, it often relies on gold-standard answers or explicit evaluation rubrics provided by powerful teacher models or human experts, which are costly and time-consuming.
- Can Large Language Models Replace Human Coders? Introducing ContentBench
Michael Haman · Feb 23, 2026 · Citations: 0
Critique Edit Automatic Metrics
This paper introduces ContentBench, a public benchmark suite that helps answer this replacement question by tracking how much agreement low-cost LLMs achieve and what they cost on the same interpretive coding tasks.
- PrivAct: Internalizing Contextual Privacy Preservation via Multi-Agent Preference Training
Yuhan Cheng, Hancheng Ye, Hai Helen Li, Jingwei Sun, Yiran Chen · Feb 14, 2026 · Citations: 0
Pairwise Preference Automatic Metrics Multi Agent
We propose PrivAct, a contextual privacy-aware multi-agent learning framework that internalizes contextual privacy preservation directly into models' generation behavior for privacy-compliant agentic actions.
- MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models
Boqi Chen, Xudong Liu, Jiachuan Peng, Marianne Frey-Marti, Bang Zheng · Feb 25, 2026 · Citations: 0
Expert Verification Automatic Metrics
Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity.
- SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video
Guanyi Qin, Xiaozhen Wang, Zhu Zhuo, Chang Han Low, Yuancan Xiao · Feb 25, 2026 · Citations: 0
Expert Verification Automatic Metrics
Existing AI systems offer binary safety verification or static detection, ignoring the phase-dependent nature of intraoperative reasoning.
- SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents
Patrick Tser Jern Kon, Archana Pradeep, Ang Chen, Alexander P. Ellis, Warren Hunt · Feb 25, 2026 · Citations: 0
Automatic Metrics Long Horizon
Our approach combines supervised fine-tuning on expert-augmented trajectories with agentic reinforcement learning that explicitly discourages degenerative looping and unproductive expert collaboration.
- Structurally Aligned Subtask-Level Memory for Software Engineering Agents
Kangning Shen, Jingyuan Zhang, Chenxi Sun, Wencong Zeng, Yang Yue · Feb 25, 2026 · Citations: 0
Automatic Metrics Long Horizon
Large Language Models (LLMs) have demonstrated significant potential as autonomous software engineering (SWE) agents.
- ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models
Manav Nitin Kapadnis, Lawanya Baghel, Atharva Naik, Carolyn Rosé · Feb 17, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
In practice, users iteratively refine visualizations through multi-turn interactions that require maintaining common ground, tracking prior edits, and adapting to evolving preferences.
- Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages
Somnath Banerjee, Rima Hazra, Animesh Mukherjee · Feb 14, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
Yet safety pipelines, benchmarks, and alignment still largely target English and a handful of high-resource languages, implicitly assuming safety and factuality ''transfer'' across languages.
- Rethinking Metrics for Lexical Semantic Change Detection
Roksana Goworek, Haim Dubossarsky · Feb 17, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
Lexical semantic change detection (LSCD) increasingly relies on contextualised language model embeddings, yet most approaches still quantify change using a small set of semantic change metrics, primarily Average Pairwise Distance (APD) and
- Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception
Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai · Feb 12, 2026 · Citations: 0
Automatic Metrics Tool Use
To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM.
- AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering
Yuzhu Cai, Zexi Liu, Xinyu Zhu, Cheng Wang, Siheng Chen · Feb 8, 2026 · Citations: 0
Automatic Metrics Long Horizon
Autonomous Machine Learning Engineering (MLE) requires agents to perform sustained, iterative optimization over long horizons.
- KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models
Zukang Xu, Zhixiong Zhao, Xing Hu, Zhixuan Chen, Dawei Yang · Jan 30, 2026 · Citations: 0
Automatic MetricsSimulation Env
Mixture of Experts (MoE) models have achieved great success by significantly improving performance while maintaining computational efficiency through sparse expert activation.
- Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching
Roy Miles, Aysim Toker, Andreea-Maria Oncescu, Songcen Xu, Jiankang Deng · Feb 26, 2026 · Citations: 0
Automatic Metrics Long Horizon
This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search.
- Continuous Telemonitoring of Heart Failure using Personalised Speech Dynamics
Yue Pan, Xingyao Wang, Hanyue Zhang, Liwei Liu, Changxin Li · Feb 23, 2026 · Citations: 0
Automatic Metrics Long Horizon
The model's high sensitivity was further corroborated by additional follow-up data, confirming its efficacy in predicting HF deterioration and its potential to secure patient safety in remote, home-based settings.
- REDSearcher: A Scalable and Cost-Efficient Framework for Long-Horizon Search Agents
Zheng Chu, Xiao Wang, Jack Hong, Huiming Fan, Yuqi Huang · Feb 15, 2026 · Citations: 0
Automatic Metrics Tool Use
To address these challenges, we propose REDSearcher, a unified framework that codesigns complex task synthesis, midtraining, and posttraining for scalable searchagent optimization.
- AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning
Yutong Wang, Siyuan Xiong, Xuebo Liu, Wenkang Zhou, Liang Ding · Feb 26, 2026 · Citations: 0
Automatic Metrics Multi Agent
We propose AgentDropoutV2, a test-time rectify-or-reject pruning framework designed to dynamically optimize MAS information flow without retraining.
- A Benchmark for Deep Information Synthesis
Debjit Paul, Daniel Murphy, Milan Gritta, Ronald Cardenas, Victor Prokhorov · Feb 24, 2026 · Citations: 0
Automatic Metrics Tool Use
To address this, we introduce DEEPSYNTH, a novel benchmark designed to evaluate agents on realistic, time-consuming problems that combine information gathering, synthesis, and structured reasoning to produce insights.
- AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG
Qijie You, Wenkai Yu, Wentao Zhang · Feb 22, 2026 · Citations: 0
Automatic Metrics Long Horizon
With the rapid advancement of agent-based methods in recent years, Agentic RAG has undoubtedly become an important research direction.
- Counterfactual Simulation Training for Chain-of-Thought Faithfulness
Peter Hase, Christopher Potts · Feb 24, 2026 · Citations: 0
Automatic MetricsSimulation Env
In this paper, we introduce a training method called Counterfactual Simulation Training (CST), which aims to improve CoT faithfulness by rewarding CoTs that enable a simulator to accurately predict a model's outputs over counterfactual…
- GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL
Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang · Feb 25, 2026 · Citations: 0
Automatic Metrics Long Horizon
Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks.
- Classroom Final Exam: An Instructor-Tested Reasoning Benchmark
Chongyang Gao, Diji Yang, Shuyan Zhou, Xichen Yan, Luchuan Song · Feb 23, 2026 · Citations: 0
Automatic Metrics Long Horizon
We introduce (Classroom Final Exam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains.
- Do LLMs and VLMs Share Neurons for Inference? Evidence and Mechanisms of Cross-Modal Transfer
Chenhang Cui, An Zhang, Yuxin Chen, Gelei Deng, Jingnan Zheng · Feb 22, 2026 · Citations: 0
Automatic Metrics Long Horizon
Across diverse mathematics and perception benchmarks, SNRF consistently enhances LVLM inference performance while preserving perceptual capabilities.
- A Hierarchical Multi-Agent System for Autonomous Discovery in Geoscientific Data Archives
Dmitrii Pantiukhin, Ivan Kuznetsov, Boris Shapkin, Antonia Anna Jost, Thomas Jung · Feb 24, 2026 · Citations: 0
Automatic Metrics Long Horizon
Here we present PANGAEA-GPT, a hierarchical multi-agent framework designed for autonomous data discovery and analysis.