Tag: General

General-purpose raters without strict specialist requirements.

Papers in tag: 585

Tag RSS Global RSS

Research Utility Snapshot

Evaluation Modes

Automatic Metrics (16)
Human Eval (4)
Simulation Env (3)

Human Feedback Types

Pairwise Preference (5)
Critique Edit (1)
Expert Verification (1)

Required Expertise

General (20)

ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction

Che Wang, Fuyao Zhang, Jiaming Zhang, Ziqi Zhang, Yinghui Wang, Longtao Huang · Feb 24, 2026 · Citations: 0

Automatic Metrics General

Large Language Model (LLM) agents are susceptible to Indirect Prompt Injection (IPI) attacks, where malicious instructions in retrieved content hijack the agent's execution.
Existing defenses typically rely on strict filtering or refusal mechanisms, which suffer from a critical limitation: over-refusal, prematurely terminating valid agentic workflows.

CAMEL: Confidence-Gated Reflection for Reward Modeling

Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar, Kun Xu · Feb 24, 2026 · Citations: 0

Pairwise PreferenceCritique Edit Automatic Metrics General

Reward models play a fundamental role in aligning large language models with human preferences.
Existing methods predominantly follow two paradigms: scalar discriminative preference models, which are efficient but lack interpretability, and generative judging models, which offer richer reasoning at the cost of higher computational ove

CARE: An Explainable Computational Framework for Assessing Client-Perceived Therapeutic Alliance Using Large Language Models

Anqi Li, Chenxiao Wang, Yu Lu, Renjun Xu, Lizhi Ma, Zhenzhong Lan · Feb 24, 2026 · Citations: 0

Human EvalAutomatic Metrics General

Experiments show that CARE outperforms leading LLMs and substantially reduces the gap between counselor evaluations and client-perceived alliance, achieving over 70% higher Pearson correlation with client ratings.
CARE also produces high-quality, contextually grounded rationales, validated by both automatic and human evaluations.

Semantic Novelty at Scale: Narrative Shape Taxonomy and Readership Prediction in 28,606 Books

W. Frederick Zimmerman · Feb 24, 2026 · Citations: 0

Automatic Metrics General

Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning

Justin Lovelace, Christian Belardi, Sofian Zalouk, Adhitya Polavaram, Srivatsa Kundurthy, Kilian Q. Weinberger · Feb 24, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics General

Evaluations show STAR-LDM significantly outperforms similar-sized models on language understanding benchmarks and achieves $>70\%$ win rates in LLM-as-judge comparisons for narrative coherence and commonsense reasoning.

PreScience: A Benchmark for Forecasting Scientific Contributions

Anirudh Ajith, Amanpreet Singh, Jay DeYoung, Nadav Kunievsky, Austin C. Kozlowski, Oyvind Tafjord · Feb 24, 2026 · Citations: 0

Human EvalSimulation Env General

We introduce PreScience -- a scientific forecasting benchmark that decomposes the research process into four interdependent generative tasks: collaborator prediction, prior work selection, contribution generation, and impact prediction.
We develop baselines and evaluations for each task, including LACERScore, a novel LLM-based measure of contribution similarity that outperforms previous metrics and approximates inter-annotator agreement.

Contextual Safety Reasoning and Grounding for Open-World Robots

Zachary Ravichandran, David Snyder, Alexander Robey, Hamed Hassani, Vijay Kumar, George J. Pappas · Feb 23, 2026 · Citations: 0

Simulation Env General

Traditional safety approaches enforce fixed constraints in user-specified contexts, limiting their ability to handle the open-ended contextual variability of real-world deployment.
We address this gap via CORE, a safety framework that enables online contextual reasoning, grounding, and enforcement without prior knowledge of the environment (e.g., maps or safety specifications).

Hyper-KGGen: A Skill-Driven Knowledge Extractor for High-Quality Knowledge Hypergraph Generation

Rizhuo Huang, Yifan Feng, Rundong Xue, Shihui Ying, Jun-Hai Yong, Chuan Shi · Feb 23, 2026 · Citations: 0

Expert Verification Automatic Metrics General

Additionally, we present \textbf{HyperDocRED}, a rigorously annotated benchmark for document-level knowledge hypergraph extraction.

Personalized Prediction of Perceived Message Effectiveness Using Large Language Model Based Digital Twins

Jasmin Han, Janardan Devkota, Joseph Waring, Amanda Luken, Felix Naughton, Roger Vilardaga · Feb 23, 2026 · Citations: 0

Rubric Rating Automatic Metrics General

Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

Dongming Jiang, Yi Li, Songtao Wei, Jinxin Yang, Ayushi Kishore, Alysa Zhao · Feb 22, 2026 · Citations: 0

Automatic Metrics General

Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows.
Despite rapid architectural development, the empirical foundations of these systems remain fragile: existing benchmarks are often underscaled, evaluation metrics are misaligned with semantic utility, performance varies significantly across

Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering

Maryam Amirizaniani, Alireza Salemi, Hamed Zamani · Feb 22, 2026 · Citations: 0

Pairwise Preference Automatic Metrics General

Personalization in Question Answering (QA) requires answers that are both accurate and aligned with users' background, preferences, and historical context.
By optimizing multi-turn reasoning trajectories under a personalized reward function, the framework reinforces reasoning paths that better align with user-specific preferences and contextual signals reflected by the reward model.

VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval

Diogo Glória-Silva, David Semedo, João Maglhães · Feb 22, 2026 · Citations: 0

Automatic Metrics General

Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90\% accuracy on plan-aware VQA.

Capable but Unreliable: Canonical Path Deviation as a Causal Mechanism of Agent Failure in Long-Horizon Tasks

Wilson Y. Lee · Feb 22, 2026 · Citations: 0

Automatic Metrics General

Why do language agents fail on tasks they are capable of solving?
Every well-defined tool-use task imposes a canonical solution path (i.e., a convergent set of tool invocations shared across successful runs) and agent success depends critically on whether a trajectory stays within this path's operating en

Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language

Toheeb Aduramomi Jimoh, Tabea De Wille, Nikola S. Nikolov · Feb 21, 2026 · Citations: 0

Pairwise Preference Human Eval General

One annotator pair achieved almost perfect agreement ($κ= 0.8743$; $93.8\%$ raw agreement), exceeding a number of reported benchmarks for English sarcasm research works.

MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs

Chun Yan Ryan Kan, Tommy Tran, Vedant Yadav, Ava Cai, Kevin Zhu, Ruizhe Li · Feb 21, 2026 · Citations: 0

Red Team Automatic Metrics General

Rethinking Retrieval-Augmented Generation as a Cooperative Decision-Making Problem

Lichang Song, Ting Long, Yi Chang · Feb 21, 2026 · Citations: 0

Automatic Metrics General

To overcome this limitation, we reformulate RAG as a cooperative multi-agent decision-making problem and propose Cooperative Retrieval-Augmented Generation (CoRAG), a framework in which the reranker and the generator act as peer decision-ma

Semantic Substrate Theory: An Operator-Theoretic Framework for Geometric Semantic Drift

Stephen Russell · Feb 21, 2026 · Citations: 0

Automatic Metrics General

Validating Political Position Predictions of Arguments

Jordan Robinson, Angus R. Williams, Katie Atkinson, Anthony G. Cohn · Feb 20, 2026 · Citations: 0

Pairwise Preference Human Eval General

Real-world knowledge representation often requires capturing subjective, continuous attributes -- such as political positions -- that conflict with pairwise validation, the widely accepted gold standard for human evaluation.
We address this challenge through a dual-scale validation framework applied to political stance prediction in argumentative discourse, combining pointwise and pairwise human annotation.

Context-Aware Mapping of 2D Drawing Annotations to 3D CAD Features Using LLM-Assisted Reasoning for Manufacturing Automation

Muhammad Tayyab Khan, Lequn Chen, Wenhe Feng, Seung Ki Moon · Feb 20, 2026 · Citations: 0

Automatic MetricsSimulation Env General

When deterministic scoring cannot resolve an ambiguity, the system escalates to multimodal and constrained large-language-model reasoning, followed by a single human-in-the-loop (HITL) review step.
By prioritizing deterministic rules, clear decision tracking, and retaining unresolved cases for human review, the framework provides a practical foundation for downstream manufacturing automation in real-world industrial environments.

Simplifying Outcomes of Language Model Component Analyses with ELIA

Aaron Louis Eidt, Nils Feldhus · Feb 20, 2026 · Citations: 0

Pairwise Preference Automatic Metrics General

The effectiveness of this approach was empirically validated through a mixed-methods user study, which revealed a clear preference for interactive, explorable interfaces over simpler, static visualizations.