Tag: General

General-purpose raters without strict specialist requirements.

Papers in tag: 590

Tag RSS Global RSS

Research Utility Snapshot

Evaluation Modes

Automatic Metrics (15)
Simulation Env (5)
Human Eval (2)

Human Feedback Types

Pairwise Preference (4)
Critique Edit (2)
Red Team (2)

Required Expertise

General (20)

Discovering Implicit Large Language Model Alignment Objectives

Edward Chen, Sanmi Koyejo, Carlos Guestrin · Feb 17, 2026 · Citations: 0

Rubric Rating Human Eval General

To address these limitations, we introduce Obj-Disco, a framework that automatically decomposes an alignment reward signal into a sparse, weighted combination of human-interpretable natural language objectives.
Extensive evaluations across diverse tasks, model sizes, and alignment algorithms demonstrate the framework's robustness.

How to Train Your Long-Context Visual Document Model

Austin Veselka · Feb 16, 2026 · Citations: 0

Pairwise Preference Automatic Metrics General

We systematically study continued pretraining, supervised finetuning, and preference optimization for 24B and 32B parameter models, backed by extensive LC evaluations and ablations to bridge this gap, and achieve state-of-the-art performanc
In addition to this, our key findings include: (i) training on context lengths that match evaluation context lengths outperforms training on longer contexts, (ii) training and evaluating with page indices provides a simple, high-impact boos

Colosseum: Auditing Collusion in Cooperative Multi-Agent Systems

Mason Nakamura, Abhinav Kumar, Saswat Das, Sahar Abdelnabi, Saaduddin Mahmud, Ferdinando Fioretto · Feb 16, 2026 · Citations: 0

Simulation Env General

Multi-agent systems, where LLM agents communicate through free-form language, enable sophisticated coordination for solving complex cooperative tasks.
This surfaces a unique safety problem when individual agents form a coalition and \emph{collude} to pursue secondary goals and degrade the joint objective.

OpaqueToolsBench: Learning Nuances of Tool Behavior Through Interaction

Skyler Hallinan, Thejas Venkatesh, Xiang Ren, Sai Praneeth Karimireddy, Ashwin Paranjape, Yuhao Zhang · Feb 16, 2026 · Citations: 0

Simulation Env General

Tool-calling is essential for Large Language Model (LLM) agents to complete real-world tasks.
While most existing benchmarks assume simple, perfectly documented tools, real-world tools (e.g., general "search" APIs) are often opaque, lacking clear best practices or failure modes.

Tool-Aware Planning in Contact Center AI: Evaluating LLMs through Lineage-Guided Query Decomposition

Varun Nathan, Shreyas Guha, Ayush Kumar · Feb 16, 2026 · Citations: 0

Critique Edit Automatic Metrics General

We present a domain-grounded framework and benchmark for tool-aware plan generation in contact centers, where answering a query for business insights, our target use case, requires decomposing it into executable steps over structured tools
Our contributions are threefold: (i) a reference-based plan evaluation framework operating in two modes - a metric-wise evaluator spanning seven dimensions (e.g., tool-prompt alignment, query adherence) and a one-shot evaluator; (ii) a data

A Geometric Analysis of Small-sized Language Model Hallucinations

Emanuele Ricco, Elia Onofri, Lorenzo Cima, Stefano Cresci, Roberto Di Pietro · Feb 16, 2026 · Citations: 0

Automatic Metrics General

Hallucinations -- fluent but factually incorrect responses -- pose a major challenge to the reliability of language models, especially in multi-step or agentic settings.
Our findings, framing hallucinations from a geometric perspective in the embedding space, complement traditional knowledge-centric and single-response evaluation paradigms, paving the way for further research.

Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation

Shiwei Hong, Lingyao Li, Ethan Z. Rong, Chenxinran Shen, Zhicong Lu · Feb 16, 2026 · Citations: 0

Pairwise PreferenceRubric Rating Human Eval General

Prior work has explored multi-turn interaction and feedback for LLM writing, but evaluations still largely center on prompts and localized feedback, leaving persistent public reception in online communities underexamined.
We test whether broadcast community discussion improves stand-up comedy writing in a controlled multi-agent sandbox: in the discussion condition, critic and audience threads are recorded, filtered, stored as social memory, and later retriev

Exposing the Systematic Vulnerability of Open-Weight Models to Prefill Attacks

Lukas Struppek, Adam Gleave, Kellin Pelrine · Feb 16, 2026 · Citations: 0

Red Team Automatic Metrics General

MCPShield: A Security Cognition Layer for Adaptive Trust Calibration in Model Context Protocol Agents

Zhenhong Zhou, Yuanhe Zhang, Hongwei Cai, Moayad Aloqaily, Ouns Bouachir, Linsey Pang · Feb 15, 2026 · Citations: 0

Automatic Metrics General

The Model Context Protocol (MCP) standardizes tool use for LLM-based agents and enable third-party servers.
This openness introduces a security misalignment: agents implicitly trust tools exposed by potentially untrusted MCP servers.

Investigation for Relative Voice Impression Estimation

Kenichi Fujita, Yusuke Ijima · Feb 15, 2026 · Citations: 0

Pairwise Preference Automatic Metrics General

The estimation target is a low-dimensional vector derived from subjective evaluations, quantifying the perceptual shift of the second utterance relative to the first along an antonymic axis (e.g., ``Dark--Bright'').

SCOPE: Selective Conformal Optimized Pairwise LLM Judging

Sher Badshah, Ali Emami, Hassan Sajjad · Feb 13, 2026 · Citations: 0

Pairwise Preference Automatic Metrics General

Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation.
Despite their practicality, LLM judges remain prone to miscalibration and systematic biases.

Towards interpretable models for language proficiency assessment: Predicting the CEFR level of Estonian learner texts

Kais Allkivi · Feb 13, 2026 · Citations: 0

Automatic MetricsSimulation Env General

Additional evaluation on an earlier exam sample revealed that the writings have become more complex over a 7-10-year period, while accuracy still reached 0.8 with some feature sets.
The results have been implemented in the writing evaluation module of an Estonian open-source language learning environment.

BrowseComp-$V^3$: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents

Huanyao Zhang, Jiepeng Zhou, Bo Li, Bowen Zhou, Yanzhe Shan, Haishan Lu · Feb 13, 2026 · Citations: 0

Automatic MetricsSimulation Env General

Multimodal large language models (MLLMs), equipped with increasingly advanced planning and tool-use capabilities, are evolving into autonomous agents capable of performing multimodal web browsing and deep search in open-world environments.
However, existing benchmarks for multimodal browsing remain limited in task complexity, evidence accessibility, and evaluation granularity, hindering comprehensive and reproducible assessments of deep search capabilities.

PMG: Parameterized Motion Generator for Human-like Locomotion Control

Chenxi Han, Yuheng Min, Zihao Huang, Ao Hong, Hang Liu, Yi Cheng · Feb 13, 2026 · Citations: 0

Automatic Metrics General

Recent advances in data-driven reinforcement learning and motion tracking have substantially improved humanoid locomotion, yet critical practical challenges remain.
To address these limitations, we propose the Parameterized Motion Generator (PMG), a real-time motion generator grounded in an analysis of human motion structure that synthesizes reference trajectories using only a compact set of parameteri

Think like a Scientist: Physics-guided LLM Agent for Equation Discovery

Jianke Yang, Ohm Venkatachalam, Mohammad Kianezhad, Sharvaree Vadgama, Rose Yu · Feb 12, 2026 · Citations: 0

Automatic Metrics General

We introduce KeplerAgent, an agentic framework that explicitly follows this scientific reasoning process.
The agent coordinates physics-based tools to extract intermediate structure and uses these results to configure symbolic regression engines such as PySINDy and PySR, including their function libraries and structural constraints.

TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents

Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Heiko Ludwig, Holger Boche · Feb 12, 2026 · Citations: 0

Simulation Env General

Advances in large language models (LLMs) are driving a shift toward using reinforcement learning (RL) to train agents from iterative, multi-turn interactions across tasks.
By moving search from inference time to the rollout stage of training, TSR provides a simple and general mechanism for stronger multi-turn agent learning, complementary to existing frameworks and rejection-sampling-style selection methods.

Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models

Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis · Feb 12, 2026 · Citations: 0

Red Team Automatic Metrics General

The Automatic Verification of Image-Text Claims (AVerImaTeC) Shared Task

Rui Cao, Zhenyun Deng, Yulong Chen, Michael Schlichtkrull, Andreas Vlachos · Feb 11, 2026 · Citations: 0

Automatic Metrics General

The winning team, HUMANE, achieved an AVerImaTeC score of 0.5455.
This paper provides a detailed description of the shared task, presents the complete evaluation results, and discusses key insights and lessons learned.

Large Language Models and Impossible Language Acquisition: "False Promise" or an Overturn of our Current Perspective towards AI

Ziyan Wang, Longlong Ma · Feb 9, 2026 · Citations: 0

Critique Edit Automatic Metrics General

In Chomsky's provocative critique "The False Promise of CHATGPT," Large Language Models (LLMs) are characterized as mere pattern predictors that do not acquire languages via intrinsic causal and self-correction structures like humans, there

RoPE-LIME: RoPE-Space Locality + Sparse-K Sampling for Efficient LLM Attribution

Isaac Picov, Ritesh Goru · Feb 6, 2026 · Citations: 0

Automatic Metrics General