Skip to content
← Back to explorer

Tag: General

General-purpose raters without strict specialist requirements.

Papers in tag: 528

Research Utility Snapshot

Evaluation Modes

  • Automatic Metrics (17)
  • Simulation Env (3)
  • Human Eval (1)

Human Feedback Types

  • Pairwise Preference (4)
  • Demonstrations (2)
  • Critique Edit (1)

Required Expertise

  • General (20)
Distill and Align Decomposition for Enhanced Claim Verification

Jabez Magomere, Elena Kochkina, Samuel Mensah, Simerjot Kaur, Fernando Acero, Arturo Oncevay · Feb 25, 2026 · Citations: 0

Human EvalAutomatic Metrics General
  • Across six evaluation settings, our trained 8B decomposer improves downstream verification performance to (71.75%) macro-F1, outperforming prompt-based approaches ((+1.99), (+6.24)) and existing RL methods ((+5.84)).
  • Human evaluation confirms the high quality of the generated subclaims.
FewMMBench: A Benchmark for Multimodal Few-Shot Learning

Mustafa Dogan, Ilker Kesen, Iacer Calixto, Aykut Erdem, Erkut Erdem · Feb 25, 2026 · Citations: 0

Demonstrations Automatic Metrics General
  • In this paper, we introduce FewMMBench, a comprehensive benchmark designed to evaluate MLLMs under few-shot conditions, with a focus on In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting.
The ASIR Courage Model: A Phase-Dynamic Framework for Truth Transitions in Human and AI Systems

Hyo Jin Kim · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics General
  • Although initially formulated for human truth-telling under asymmetric stakes, the same phase-dynamic architecture extends to AI systems operating under policy constraints and alignment filters.
  • The framework therefore provides a unified structural account of both human silence under pressure and AI preference-driven distortion.
Explore-on-Graph: Incentivizing Autonomous Exploration of Large Language Models on Knowledge Graphs with Path-refined Reward Modeling

Shiqi Yan, Yubo Chen, Ruiqi Zhou, Zhengxi Yao, Shuai Chen, Tianyi Zhang · Feb 25, 2026 · Citations: 0

Demonstrations Automatic Metrics General
  • Extensive experiments on five KGQA benchmark datasets demonstrate that, to the best of our knowledge, our method achieves state-of-the-art performance, outperforming not only open-source but also even closed-source LLMs.
Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning

Tomoya Kawabe, Rin Takano · Feb 25, 2026 · Citations: 0

Automatic Metrics General
  • We present a hierarchical multi-agent LLM-based planner with prompt optimization: an upper layer decomposes tasks and assigns them to lower-layer agents, which generate PDDL problems solved by a classical planner.
  • When plans fail, the system applies TextGrad-inspired textual-gradient updates to optimize each agent's prompt and thereby improve planning accuracy.
CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning

Zhijiang Tang, Linhua Wang, Jiaxin Qi, Weihao Jiang, Peng Hou, Anxiang Zeng · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics General
  • Image captioning remains a fundamental task for vision language understanding, yet ground-truth supervision still relies predominantly on human-annotated references.
  • Because human annotations reflect subjective preferences and expertise, ground-truth captions are often incomplete or even incorrect, which in turn limits caption models.
ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning

Xiaoxuan Wang, Han Zhang, Haixin Wang, Yidan Shi, Ruoyan Li, Kaiqiao Han · Feb 25, 2026 · Citations: 0

Simulation Env General
  • Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks.
  • Through this fine-grained analysis, we distill a unified perspective on ARL and propose SAMPO, a stable agentic policy optimization method designed to mitigate the dominant sources of instability in ARL.
LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies

Yue Yang, Shuo Cheng, Yu Fang, Homanga Bharadhwaj, Mingyu Ding, Gedas Bertasius · Feb 25, 2026 · Citations: 0

Simulation Env General
  • We introduce a 21-task simulation benchmark consisting of two challenging suites: LIBERO-Long++ and Ultra-Long.
  • Furthermore, real-world evaluations across 8 long-horizon tasks demonstrate an average success rate of 85%.
Training Generalizable Collaborative Agents via Strategic Risk Aversion

Chengrui Qu, Yizhou Zhang, Nicholas Lanzetti, Eric Mazumdar · Feb 25, 2026 · Citations: 0

Automatic Metrics General
  • Many emerging agentic paradigms require agents to collaborate with one another (or people) to achieve shared goals.
  • Inspired by these insights, we develop a multi-agent reinforcement learning (MARL) algorithm that integrates strategic risk aversion into standard policy optimization methods.
Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information

Umid Suleymanov, Zaur Rajabov, Emil Mirzazada, Murat Kantarcioglu · Feb 25, 2026 · Citations: 0

Critique Edit Automatic Metrics General
  • To address this, we introduce SemSIEdit, an inference-time framework where an agentic "Editor" iteratively critiques and rewrites sensitive spans to preserve narrative flow rather than simply refusing to answer.
  • Our analysis reveals a Privacy-Utility Pareto Frontier, where this agentic rewriting reduces leakage by 34.6% across all three SemSI categories while incurring a marginal utility loss of 9.8%.
VecGlypher: Unified Vector Glyph Generation with Language Models

Xiaoke Huang, Bhavul Gauri, Kam Woh Ng, Tony Ng, Mengmeng Xu, Zhiheng Liu · Feb 25, 2026 · Citations: 0

Automatic Metrics General
  • On cross-family OOD evaluation, VecGlypher substantially outperforms both general-purpose LLMs and specialized vector-font baselines for text-only generation, while image-referenced generation reaches a state-of-the-art performance, with ma
Provably Safe Generative Sampling with Constricting Barrier Functions

Darshan Gadginmath, Ahmed Allibhoy, Fabio Pasqualetti · Feb 24, 2026 · Citations: 0

Automatic Metrics General
  • However, a critical gap remains for their deployment in safety-critical domains: the lack of formal guarantees that generated samples will satisfy hard constraints.
  • We address this by proposing a safety filtering framework that acts as an online shield for any pre-trained generative model.
The Headless Firm: How AI Reshapes Enterprise Boundaries

Tassilo Klein, Sebastian Wieczorek · Feb 24, 2026 · Citations: 0

Automatic Metrics General
  • We argue that agentic AI induces a structural change in how coordination costs scale: in prior modular systems, integration cost grew with interaction topology (O(n^2) in the number of components); in protocol-mediated agentic systems, inte
  • This shift selects for a specific organizational equilibrium -- the Headless Firm -- structured as an hourglass: a personalized generative interface at the top, a standardized protocol waist in the middle, and a competitive market of micro-
Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment

Mengxuan Hu, Vivek V. Datla, Anoop Kumar, Zihan Guan, Sheng Li, Alfy Samuel · Feb 24, 2026 · Citations: 0

Pairwise PreferenceRed Team Automatic Metrics General
  • Recent advances in alignment techniques such as Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), and Direct Preference Optimization (DPO) have improved the safety of large language models (LLMs).
  • We construct and release a novel Chain-of-Thought (CoT) fine-tuning dataset that includes both utility-oriented and safety-critical prompts with step-by-step rationales.
Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

Yining Hong, Huang Huang, Manling Li, Li Fei-Fei, Jiajun Wu, Yejin Choi · Feb 24, 2026 · Citations: 0

Automatic Metrics General
  • Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \textit{reflection-in-action}, where the agent uses test-time scaling to generate and score multiple candidat
  • We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight for proper long-horizon credit assignment.