Tag: General

General-purpose raters without strict specialist requirements.

Papers in tag: 585

Tag RSS Global RSS

Research Utility Snapshot

Evaluation Modes

Automatic Metrics (18)
Simulation Env (2)
Human Eval (1)

Human Feedback Types

Demonstrations (4)
Pairwise Preference (4)
Critique Edit (1)

Required Expertise

General (20)

Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models

Chungpa Lee, Jy-yong Sohn, Kangwook Lee · Feb 26, 2026 · Citations: 0

Demonstrations Automatic Metrics General

OmniGAIA: Towards Native Omni-Modal AI Agents

Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Shijian Wang, Guanting Dong, Jiajie Jin · Feb 26, 2026 · Citations: 0

Automatic Metrics General

Human intelligence naturally intertwines omni-modal perception -- spanning vision, audio, and language -- with complex reasoning and tool usage to interact with the world.
To bridge this gap, we introduce OmniGAIA, a comprehensive benchmark designed to evaluate omni-modal agents on tasks necessitating deep reasoning and multi-turn tool execution across video, audio, and image modalities.

Moral Preferences of LLMs Under Directed Contextual Influence

Phil Blandfort, Tushar Karayil, Urja Pawar, Robert Graham, Alex McKenzie, Dmitrii Krasheninnikov · Feb 26, 2026 · Citations: 0

Pairwise Preference Automatic Metrics General

Moral benchmarks for LLMs typically use context-free prompts, implicitly assuming stable preferences.
We introduce a pilot evaluation harness for directed contextual influence in trolley-problem-style moral triage: for each demographic factor, we apply matched, direction-flipped contextual influences that differ only in which group they fav

Towards Better RL Training Data Utilization via Second-Order Rollout

Zhe Yang, Yudong Wang, Rang Li, Zhifang Sui · Feb 26, 2026 · Citations: 0

Critique Edit Automatic Metrics General

AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R. Bowman, Sara Price · Feb 26, 2026 · Citations: 0

Demonstrations Automatic Metrics General

We introduce AuditBench, an alignment auditing benchmark.
To demonstrate AuditBench's utility, we develop an investigator agent that autonomously employs a configurable set of auditing tools.

Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA

Fengyu Li, Junhao Zhu, Kaishi Song, Lu Chen, Zhongming Yao, Tianyi Li · Feb 26, 2026 · Citations: 0

Automatic Metrics General

Experiments on two benchmark datasets show that, with the same LLM backbone, Operation-R1 achieves average absolute accuracy gains of 9.55 and 6.08 percentage points over multi-step preparation baselines, with 79\% table compression and a 2

Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization

Qianben Chen, Tianrui Qin, King Zhu, Qiexiang Wang, Chengjun Yu, Shu Xu · Feb 26, 2026 · Citations: 0

Automatic Metrics General

Recent deep research agents primarily improve performance by scaling reasoning depth, but this leads to high inference cost and latency in search-intensive scenarios.
In this work, we propose \emph{Search More, Think Less} (SMTL), a framework for long-horizon agentic search that targets both efficiency and generalization.

Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training

Tianle Xia, Ming Xu, Lingxiang Hu, Yiding Sun, Wenwei Li, Linfang Shang · Feb 26, 2026 · Citations: 0

Automatic Metrics General

Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed s
We propose Search-P1, a framework that introduces path-centric reward shaping for agentic RAG training, comprising two key components: (1) Path-Centric Reward, which evaluates the structural quality of reasoning trajectories through order-a

Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference

Bo-Wei Chen, Chung-Chi Chen, An-Zi Yen · Feb 25, 2026 · Citations: 0

Automatic Metrics General

Experiments on the Massive Multitask Language Understanding (MMLU) benchmark show that our approach achieves accuracy comparable to the largest model while reducing computational costs by 20\% to 40\%.

DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs

Yanbin Wei, Jiangyue Yan, Chun Kang, Yang Chen, Hua Liu, James Kwok · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics General

This ``one-size-fits-all'' strategy often neglects model-specific and task-specific preferences, resulting in inaccurate or over-lengthy responses to graph-related queries.

Distill and Align Decomposition for Enhanced Claim Verification

Jabez Magomere, Elena Kochkina, Samuel Mensah, Simerjot Kaur, Fernando Acero, Arturo Oncevay · Feb 25, 2026 · Citations: 0

Human EvalAutomatic Metrics General

Across six evaluation settings, our trained 8B decomposer improves downstream verification performance to (71.75%) macro-F1, outperforming prompt-based approaches ((+1.99), (+6.24)) and existing RL methods ((+5.84)).
Human evaluation confirms the high quality of the generated subclaims.

FewMMBench: A Benchmark for Multimodal Few-Shot Learning

Mustafa Dogan, Ilker Kesen, Iacer Calixto, Aykut Erdem, Erkut Erdem · Feb 25, 2026 · Citations: 0

Demonstrations Automatic Metrics General

In this paper, we introduce FewMMBench, a comprehensive benchmark designed to evaluate MLLMs under few-shot conditions, with a focus on In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting.

D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models

Shunsuke Ubukata · Feb 25, 2026 · Citations: 0

Automatic Metrics General

The ASIR Courage Model: A Phase-Dynamic Framework for Truth Transitions in Human and AI Systems

Hyo Jin Kim · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics General

Although initially formulated for human truth-telling under asymmetric stakes, the same phase-dynamic architecture extends to AI systems operating under policy constraints and alignment filters.
The framework therefore provides a unified structural account of both human silence under pressure and AI preference-driven distortion.

Explore-on-Graph: Incentivizing Autonomous Exploration of Large Language Models on Knowledge Graphs with Path-refined Reward Modeling

Shiqi Yan, Yubo Chen, Ruiqi Zhou, Zhengxi Yao, Shuai Chen, Tianyi Zhang · Feb 25, 2026 · Citations: 0

Demonstrations Automatic Metrics General

Extensive experiments on five KGQA benchmark datasets demonstrate that, to the best of our knowledge, our method achieves state-of-the-art performance, outperforming not only open-source but also even closed-source LLMs.

Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning

Tomoya Kawabe, Rin Takano · Feb 25, 2026 · Citations: 0

Automatic Metrics General

We present a hierarchical multi-agent LLM-based planner with prompt optimization: an upper layer decomposes tasks and assigns them to lower-layer agents, which generate PDDL problems solved by a classical planner.
When plans fail, the system applies TextGrad-inspired textual-gradient updates to optimize each agent's prompt and thereby improve planning accuracy.

CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning

Zhijiang Tang, Linhua Wang, Jiaxin Qi, Weihao Jiang, Peng Hou, Anxiang Zeng · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics General

Image captioning remains a fundamental task for vision language understanding, yet ground-truth supervision still relies predominantly on human-annotated references.
Because human annotations reflect subjective preferences and expertise, ground-truth captions are often incomplete or even incorrect, which in turn limits caption models.

RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning

Yukun Chen, Jiaming Li, Longze Chen, Ze Gong, Jingpeng Li, Zhen Qin · Feb 25, 2026 · Citations: 0

Rubric Rating Automatic Metrics General

Extensive experiments on various visual reasoning benchmarks show that RuCL yields a remarkable +7.83% average improvement over the Qwen2.5-VL-7B model, achieving a state-of-the-art accuracy of 60.06%.

ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning

Xiaoxuan Wang, Han Zhang, Haixin Wang, Yidan Shi, Ruoyan Li, Kaiqiao Han · Feb 25, 2026 · Citations: 0

Simulation Env General

Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks.
Through this fine-grained analysis, we distill a unified perspective on ARL and propose SAMPO, a stable agentic policy optimization method designed to mitigate the dominant sources of instability in ARL.

LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies

Yue Yang, Shuo Cheng, Yu Fang, Homanga Bharadhwaj, Mingyu Ding, Gedas Bertasius · Feb 25, 2026 · Citations: 0

Simulation Env General

We introduce a 21-task simulation benchmark consisting of two challenging suites: LIBERO-Long++ and Ultra-Long.
Furthermore, real-world evaluations across 8 long-horizon tasks demonstrate an average success rate of 85%.