Tag: General

General-purpose raters without strict specialist requirements.

Papers in tag: 590

Tag RSS Global RSS

Research Utility Snapshot

Evaluation Modes

Automatic Metrics (18)
Simulation Env (2)
Human Eval (1)

Human Feedback Types

Pairwise Preference (6)
Demonstrations (2)
Red Team (2)

Required Expertise

General (20)

FENCE: A Financial and Multimodal Jailbreak Detection Dataset

Mirae Kim, Seonghun Jeong, Youngjun Kwak · Feb 20, 2026 · Citations: 0

Red Team Automatic Metrics General

A baseline detector trained on FENCE achieves 99 percent in-distribution accuracy and maintains strong performance on external benchmarks, underscoring the dataset's robustness for training reliable detection models.

Mind the Style: Impact of Communication Style on Human-Chatbot Interaction

Erik Derner, Dalibor Kučera, Aditya Gulati, Ayoub Bagheri, Nuria Oliver · Feb 19, 2026 · Citations: 0

Automatic Metrics General

Conversational agents increasingly mediate everyday digital interactions, yet the effects of their communication style on user experience and task success remain unclear.
These findings highlight the importance of user- and task-sensitive conversational agents and support that communication style personalization can meaningfully enhance interaction quality and performance.

Differences in Typological Alignment in Language Models' Treatment of Differential Argument Marking

Iskar Deng, Nathalia Xu, Shane Steinert-Threlkeld · Feb 19, 2026 · Citations: 0

Pairwise Preference Automatic Metrics General

Recent work has shown that language models (LMs) trained on synthetic corpora can exhibit typological preferences that resemble cross-linguistic regularities in human languages, particularly for syntactic phenomena such as word order.
Models reliably exhibit human-like preferences for natural markedness direction, favoring systems in which overt marking targets semantically atypical arguments.

Modeling Distinct Human Interaction in Web Agents

Faria Huq, Zora Zhiruo Wang, Zhanqiu Guo, Venu Arvind Arangarajan, Tianyue Ou, Frank Xu · Feb 19, 2026 · Citations: 0

Pairwise Preference Automatic Metrics General

Despite rapid progress in autonomous web agents, human involvement remains essential for shaping preferences and correcting agent behavior as tasks unfold.
However, current agentic systems lack a principled understanding of when and why humans intervene, often proceeding autonomously past critical decision points or requesting unnecessary confirmation.

KLong: Training LLM Agent for Extremely Long-horizon Tasks

Yue Liu, Zhiyuan Hu, Flood Sung, Jiaheng Zhang, Bryan Hooi · Feb 19, 2026 · Citations: 0

Rubric Rating Automatic Metrics General

This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks.
Specifically, we first activate basic agentic abilities of a base model with a comprehensive SFT recipe.

Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability

Shashank Aggarwal, Ram Vikas Mishra, Amit Awekar · Feb 19, 2026 · Citations: 0

Automatic Metrics General

In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other.
Current CoT evaluation narrowly focuses on target task accuracy.

Quantifying and Mitigating Socially Desirable Responding in LLMs: A Desirability-Matched Graded Forced-Choice Psychometric Study

Kensuke Okada, Yui Furukawa, Kyosuke Bunji · Feb 19, 2026 · Citations: 0

Rubric Rating Automatic Metrics General

Human self-report questionnaires are increasingly used in NLP to benchmark and audit large language models (LLMs), from persona consistency to safety and bias assessments.
We propose a psychometric framework to quantify and mitigate SDR in questionnaire-based evaluation of LLMs.

The Emergence of Lab-Driven Alignment Signatures: A Psychometric Framework for Auditing Latent Bias and Compounding Risk in Generative AI

Dusan Bosnjakovic · Feb 19, 2026 · Citations: 0

Automatic Metrics General

As Large Language Models (LLMs) transition from standalone chat interfaces to foundational reasoning layers in multi-agent systems and recursive evaluation loops (LLM-as-a-judge), the detection of durable, provider-level behavioral signatur
Traditional benchmarks measure transient task accuracy but fail to capture stable, latent response policies -- the ``prevailing mindsets'' embedded during training and alignment that outlive individual model versions.

Large Language Models Persuade Without Planning Theory of Mind

Jared Moore, Rasmus Overmark, Ned Cooper, Beba Cibralic, Nick Haber, Cameron R. Jones · Feb 19, 2026 · Citations: 0

Automatic Metrics General

A growing body of work attempts to evaluate the theory of mind (ToM) abilities of humans and large language models (LLMs) using static, non-interactive question-and-answer benchmarks.
We address this gap with a novel ToM task that requires an agent to persuade a target to choose one of three policy proposals by strategically revealing information.

Claim Automation using Large Language Model

Zhengda Mo, Zhiyu Quan, Eli O'Donohue, Kaiwen Zhong · Feb 18, 2026 · Citations: 0

Human EvalAutomatic Metrics General

We assess this module using a multi-dimensional evaluation framework that combines automated semantic similarity metrics with human evaluation, enabling a rigorous examination of both practical utility and predictive accuracy.

Who can we trust? LLM-as-a-jury for Comparative Assessment

Mengjie Qian, Guangzhi Sun, Mark J. F. Gales, Kate M. Knill · Feb 18, 2026 · Citations: 0

Pairwise Preference Automatic Metrics General

Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwise comparative judgements.
Existing approaches typically rely on single judges or aggregate multiple judges assuming equal reliability.

Creating a digital poet

Vered Tohar, Tsahi Hayat, Amir Leshem · Feb 18, 2026 · Citations: 0

Automatic Metrics General

In a blinded authorship test with 50 humanities students and graduates (three AI poems and three poems by well-known poets each), judgments were at chance: human poems were labeled human 54% of the time and AI poems 52%, with 95% confidence

TabAgent: A Framework for Replacing Agentic Generative Components with Tabular-Textual Classifiers

Ido Levy, Eilam Shapira, Yinon Goldshtein, Avi Yaeli, Nir Mashkif, Segev Shlomov · Feb 18, 2026 · Citations: 0

Automatic Metrics General

Agentic systems, AI architectures that autonomously execute multi-step workflows to achieve complex goals, are often built using repeated large language model (LLM) calls for closed-set decision tasks such as routing, shortlisting, gating,
We propose TabAgent, a framework for replacing generative decision components in closed-set selection tasks with a compact textual-tabular classifier trained on execution traces.

MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks

Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen, Lang Yin · Feb 18, 2026 · Citations: 0

Pairwise Preference Simulation Env General

Existing evaluations of agents with memory typically assess memorization and action in isolation.
One class of benchmarks evaluates memorization by testing recall of past conversations or text but fails to capture how memory is used to guide future decisions.

Learning Personalized Agents from Human Feedback

Kaiqu Liang, Julia Kruk, Shengyi Qian, Xianjun Yang, Shengjie Bi, Yuanshun Yao · Feb 18, 2026 · Citations: 0

Pairwise Preference Automatic Metrics General

Modern AI agents are powerful but often fail to align with the idiosyncratic, evolving preferences of individual users.
Prior approaches typically rely on static datasets, either training implicit preference models on interaction history or encoding user profiles in external memory.

Intent Laundering: AI Safety Datasets Are Not What They Seem

Shahriar Golchin, Marc Wetter · Feb 17, 2026 · Citations: 0

Red Team Automatic Metrics General

We systematically evaluate the quality of widely used AI safety datasets from two perspectives: in isolation and in practice.
We find that these datasets overrely on "triggering cues": words or phrases with overt negative/sensitive connotations that are intended to trigger safety mechanisms explicitly, which is unrealistic compared to real-world attacks.

Perspectives - Interactive Document Clustering in the Discourse Analysis Tool Suite

Tim Fischer, Chris Biemann · Feb 17, 2026 · Citations: 0

Demonstrations Automatic Metrics General

This paper introduces Perspectives, an interactive extension of the Discourse Analysis Tool Suite designed to empower Digital Humanities (DH) scholars to explore and organize large, unstructured document collections.
Perspectives implements a flexible, aspect-focused document clustering pipeline with human-in-the-loop refinement capabilities.

In Agents We Trust, but Who Do Agents Trust? Latent Source Preferences Steer LLM Generations

Mohammad Aflah Khan, Mahsa Amani, Soumi Das, Bishwamittra Ghosh, Qinyuan Wu, Krishna P. Gummadi · Feb 17, 2026 · Citations: 0

Pairwise Preference Automatic Metrics General

Agents based on Large Language Models (LLMs) are increasingly being deployed as interfaces to information on online platforms.
These agents filter, prioritize, and synthesize information retrieved from the platforms' back-end databases or via web search.

World-Model-Augmented Web Agents with Action Correction

Zhouzhou Shen, Xueyu Hu, Xiyun Li, Tianqing Fang, Juncheng Li, Shengyu Zhang · Feb 17, 2026 · Citations: 0

Llm As JudgeSimulation Env General

Web agents based on large language models have demonstrated promising capability in automating web tasks.
However, current web agents struggle to reason out sensible actions due to the limitations of predicting environment changes, and might not possess comprehensive awareness of execution risks, prematurely performing risky actions that cause

Orchestration-Free Customer Service Automation: A Privacy-Preserving and Flowchart-Guided Framework

Mengze Hong, Chen Jason Zhang, Zichang Guo, Hanlin Gu, Di Jiang, Li Qing · Feb 17, 2026 · Citations: 0

Demonstrations Automatic Metrics General

Existing approaches either rely on modular system designs with extensive agent orchestration or employ over-simplified instruction schemas, providing limited guidance and poor generalizability.
We first define the components and evaluation metrics for TOFs, then formalize a cost-efficient flowchart construction algorithm to abstract procedural knowledge from service dialogues.