Research Utility Snapshot
Evaluation Modes
- Automatic Metrics (16)
- Simulation Env (2)
- Human Eval (1)
Human Feedback Types
- Critique Edit (19)
- Pairwise Preference (3)
- Expert Verification (1)
Required Expertise
- General (14)
- Coding (3)
- Law (1)
Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information Umid Suleymanov, Zaur Rajabov, Emil Mirzazada, Murat Kantarcioglu · Feb 25, 2026 · Citations: 0
Critique Edit Automatic Metrics General
- To address this, we introduce SemSIEdit, an inference-time framework where an agentic "Editor" iteratively critiques and rewrites sensitive spans to preserve narrative flow rather than simply refusing to answer.
- Our analysis reveals a Privacy-Utility Pareto Frontier, where this agentic rewriting reduces leakage by 34.6% across all three SemSI categories while incurring a marginal utility loss of 9.8%.
CAMEL: Confidence-Gated Reflection for Reward Modeling Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar, Kun Xu · Feb 24, 2026 · Citations: 0
Pairwise PreferenceCritique Edit Automatic Metrics General
- Reward models play a fundamental role in aligning large language models with human preferences.
- Existing methods predominantly follow two paradigms: scalar discriminative preference models, which are efficient but lack interpretability, and generative judging models, which offer richer reasoning at the cost of higher computational ove
Can Large Language Models Replace Human Coders? Introducing ContentBench Michael Haman · Feb 23, 2026 · Citations: 0
Critique Edit Automatic Metrics Coding
- This paper introduces ContentBench, a public benchmark suite that helps answer this replacement question by tracking how much agreement low-cost LLMs achieve and what they cost on the same interpretive coding tasks.
- The suite uses versioned tracks that invite researchers to contribute new benchmark datasets.
Tool-Aware Planning in Contact Center AI: Evaluating LLMs through Lineage-Guided Query Decomposition Varun Nathan, Shreyas Guha, Ayush Kumar · Feb 16, 2026 · Citations: 0
Critique Edit Automatic Metrics General
- We present a domain-grounded framework and benchmark for tool-aware plan generation in contact centers, where answering a query for business insights, our target use case, requires decomposing it into executable steps over structured tools
- Our contributions are threefold: (i) a reference-based plan evaluation framework operating in two modes - a metric-wise evaluator spanning seven dimensions (e.g., tool-prompt alignment, query adherence) and a one-shot evaluator; (ii) a data
Unlocking Reasoning Capability on Machine Translation in Large Language Models Sara Rajaee, Sebastian Vincent, Alexandre Berard, Marzieh Fadaee, Kelly Marchisio, Tom Kocmi · Feb 16, 2026 · Citations: 0
Critique Edit Automatic Metrics MathMultilingual
- We systematically evaluate several open- and closed-weights RLMs on the WMT24++ benchmark and find that enabling explicit reasoning consistently degrades translation quality across languages and models.
HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam Weiqi Zhai, Zhihai Wang, Jinghang Wang, Boyu Yang, Xiaogang Li, Xiang Xu · Feb 15, 2026 · Citations: 0
Expert VerificationCritique Edit Automatic Metrics Law
- Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions.
- However, community-led analyses have raised concerns that HLE contains a non-trivial number of noisy items, which can bias evaluation results and distort cross-model comparisons.
From Pixels to Policies: Reinforcing Spatial Reasoning in Language Models for Content-Aware Layout Design Sha Li, Stefano Petrangeli, Yu Shen, Xiang Chen · Feb 14, 2026 · Citations: 0
Critique Edit Simulation Env Coding
Large Language Models and Impossible Language Acquisition: "False Promise" or an Overturn of our Current Perspective towards AI Ziyan Wang, Longlong Ma · Feb 9, 2026 · Citations: 0
Critique Edit Automatic Metrics General
- In Chomsky's provocative critique "The False Promise of CHATGPT," Large Language Models (LLMs) are characterized as mere pattern predictors that do not acquire languages via intrinsic causal and self-correction structures like humans, there
RebuttalAgent: Strategic Persuasion in Academic Rebuttal via Theory of Mind Zhitao He, Zongwei Lyu, Yi R Fung · Jan 22, 2026 · Citations: 0
Pairwise PreferenceCritique Edit Human Eval General
- In this paper, we introduce RebuttalAgent, the first framework to ground academic rebuttal in Theory of Mind (ToM), operationalized through a ToM-Strategy-Response (TSR) framework that models reviewer mental state, formulates persuasion str
- To train our agent, we construct RebuttalBench, a large-scale dataset synthesized via a novel critique-and-refine approach.
VULCA-Bench: A Multicultural Vision-Language Benchmark for Evaluating Cultural Understanding Haorui Yu, Diji Yang, Hang He, Fengrui Zhang, Qiufeng Yi · Jan 12, 2026 · Citations: 0
Critique Edit Automatic Metrics General
- We introduce VULCA-Bench, a multicultural art-critique benchmark for evaluating Vision-Language Models' (VLMs) cultural understanding beyond surface-level visual perception.
- Existing VLM benchmarks predominantly measure L1-L2 capabilities (object recognition, scene description, and factual question answering) while under-evaluate higher-order cultural interpretation.
Cross-Cultural Expert-Level Art Critique Evaluation with Vision-Language Models Haorui Yu, Xuehang Wen, Fengrui Zhang, Qiufeng Yi · Jan 12, 2026 · Citations: 0
Rubric RatingCritique Edit Automatic Metrics Coding
- Existing benchmarks assess perception without interpretation, and common evaluation proxies, such as automated metrics and LLM-judge averaging, are unreliable for culturally sensitive generative tasks.
- We address this measurement gap with a tri-tier evaluation framework grounded in art-theoretical constructs (Section 2).
Reward Modeling from Natural Language Human Feedback Zongqi Wang, Rui Wang, Yuchuan Wu, Yiyao Yu, Pinyi Zhang, Shaoning Sun · Jan 12, 2026 · Citations: 0
Pairwise PreferenceCritique Edit Automatic Metrics General
- Reinforcement Learning with Verifiable reward (RLVR) on preference data has become the mainstream approach for training Generative Reward Models (GRMs).
- Typically in pairwise rewarding tasks, GRMs generate reasoning chains ending with critiques and preference labels, and RLVR then relies on the correctness of the preference labels as the training reward.
Multimodal Peer Review Simulation with Actionable To-Do Recommendations for Community-Aware Manuscript Revisions Mengze Hong, Di Jiang, Weiwei Zhao, Yawen Li, Yihang Wang, Xinyuan Luo · Nov 14, 2025 · Citations: 0
Critique Edit Simulation Env General
- Experimental results highlight the effectiveness of the proposed system in generating more comprehensive and useful reviews aligned with expert standards, surpassing ablated baselines and advancing transparent, human-centered scholarly assi
Error-Aware Knowledge Distillation via Targeted Revision for Customer-Service Summarization Hee-Jin Lee, Zhen Guo, Luchao Jin, Morteza Moazami Goudarzi · Nov 4, 2025 · Citations: 0
Critique Edit Automatic Metrics General
SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests Punya Syon Pandey, Hai Son Le, Devansh Bhardwaj, Rada Mihalcea, Zhijing Jin · Oct 6, 2025 · Citations: 0
Critique Edit Automatic Metrics General
- Yet, existing safety benchmarks rarely test vulnerabilities in domains such as political manipulation, propaganda and disinformation generation, or surveillance and information control.
- Our evaluations reveal several shortcomings: open-weight models exhibit high vulnerability to harmful compliance, with Mistral-7B reaching attack success rates as high as 97% to 98% in domains such as historical revisionism, propaganda, and
ClearFairy: Capturing Creative Workflows through Decision Structuring, In-Situ Questioning, and Rationale Inference Kihoon Son, DaEun Choi, Tae Soo Kim, Young-Ho Kim, Sangdoo Yun, Juho Kim · Sep 18, 2025 · Citations: 0
Critique Edit Automatic Metrics General
- Furthermore, exploratory applications demonstrate that captured steps can enhance generative AI agents in Figma, yielding predictions better aligned with professionals and producing coherent outcomes.
TASER: Table Agents for Schema-guided Extraction and Recommendation Nicole Cho, Kirsty Fielding, William Watson, Sumitra Ganesh, Manuela Veloso · Aug 18, 2025 · Citations: 0
Critique Edit Automatic Metrics General
- To address this, we present TASER (Table Agents for Schema-guided Extraction and Recommendation), a continuously learning, agentic table extraction system that converts highly unstructured, multi-page, heterogeneous tables into normalized,
- Our Recommender Agent reviews unmatched outputs and proposes schema revisions, enabling TASER to outperform vision-based table detection models such as Table Transformer by 10.1%.
Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback Xiaoying Zhang, Yipeng Zhang, Hao Sun, Kaituo Feng, Chaochao Lu, Chao Yang · Jun 3, 2025 · Citations: 0
Critique Edit Automatic Metrics General
SEFL: A Framework for Generating Synthetic Educational Assignment Feedback with LLM Agents Mike Zhang, Amalie Pernille Dilling, Léon Gondelman, Niels Erik Ruan Lyngdorf, Euan D. Lindsay, Johannes Bjerva · Feb 18, 2025 · Citations: 0
Critique Edit Automatic Metrics General
- Through comprehensive evaluations with three LLM judges and three human experts, across a subset of 900 outputs, we demonstrate that SEFL-tuned models outperform both their untuned counterparts and an existing baseline in terms of feedback
- The potential for societal impact is reinforced by extensive qualitative comments and ratings from human stakeholders -- both students and higher education instructors.