Skip to content
← Back to explorer

Tag: Critique Edit

Filtered HFEPX paper feed.

Papers in tag: 63

Running a Critique Edit study?

Post a Job →

Research Utility Snapshot

Evaluation Modes

  • Automatic Metrics (5)
  • Llm As Judge (2)
  • Simulation Env (1)

Human Feedback Types

  • Critique Edit (20)
  • Rubric Rating (3)
  • Rlaif Or Synthetic Feedback (2)

Required Expertise

  • General (9)
  • Coding (6)
  • Math (3)
How Much LLM Does a Self-Revising Agent Actually Need?

Sungwoo Jung, Seonil Son · Apr 8, 2026 · Citations: 0

Critique Edit Automatic Metrics General
  • Recent LLM-based agents often place world modeling, planning, and reflection inside a single language model loop.
  • We introduce a declared reflective runtime protocol that externalizes agent state, confidence signals, guarded actions, and hypothetical transitions into inspectable runtime structure.
The Self Driving Portfolio: Agentic Architecture for Institutional Asset Management

Andrew Ang, Nazym Azimbayev, Andrey Kim · Apr 2, 2026 · Citations: 0

Critique Edit Coding
  • Agentic AI shifts the investor's role from analytical execution to oversight.
  • We present an agentic strategic asset allocation pipeline in which approximately 50 specialized agents produce capital market assumptions, construct portfolios using over 20 competing methods, and critique and vote on each other's output.
When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation

Henry Peng Zou, Chunyu Miao, Wei-Chieh Huang, Yankai Chen, Yue Zhou, Hanrong Zhang · Apr 1, 2026 · Citations: 0

Critique Edit Simulation Env Coding
  • As LLM agents transition from short, static problem solving to executing complex, long-horizon tasks in dynamic environments, the ability to handle user interruptions, such as adding requirement or revising goals, during mid-task execution…
  • In this paper, we present the first systematic study of interruptible agents in long-horizon, environmentally grounded web navigation tasks, where actions induce persistent state changes.
Optimsyn: Influence-Guided Rubrics Optimization for Synthetic Data Generation

Zhiting Fan, Ruizhe Chen, Tianxiang Hu, Ru Peng, Zenan Huang, Haokai Xu · Apr 1, 2026 · Citations: 0

Rubric RatingCritique Edit Law
  • However, high-quality SFT data in knowledge-intensive domains such as humanities, social sciences, medicine, law, and finance is scarce because expert curation is expensive, privacy constraints are strict, and label consistency is hard to…
Can Large Language Models Self-Correct in Medical Question Answering? An Exploratory Study

Zaifu Zhan, Mengyuan Cui, Rui Zhang · Mar 31, 2026 · Citations: 0

Critique Edit Automatic Metrics Medicine
  • Large language models (LLMs) have achieved strong performance on medical question answering (medical QA), and chain-of-thought (CoT) prompting has further improved results by eliciting explicit intermediate reasoning; meanwhile,…
  • In this work, we conduct an exploratory analysis of self-reflective reasoning for medical multiple-choice question answering: using GPT-4o and GPT-4o-mini, we compare standard CoT prompting with an iterative self-reflection loop and track…
The Ultimate Tutorial for AI-driven Scale Development in Generative Psychometrics: Releasing AIGENIE from its Bottle

Lara Russell-Lasalandra, Hudson Golino, Luis Eduardo Garrido, Alexander P. Christensen · Mar 30, 2026 · Citations: 0

Critique Edit Coding
  • Psychological scale development has traditionally required extensive expert involvement, iterative revision, and large-scale pilot testing before psychometric evaluation can begin.
  • The `AIGENIE` R package implements the AI-GENIE framework (Automatic Item Generation with Network-Integrated Evaluation), which integrates large language model (LLM) text generation with network psychometric methods to automate the early…
Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization

He Du, Qiming Ge, Jiakai Hu, Aijun Yang, Zheng Cai, Zixian Huang · Mar 30, 2026 · Citations: 0

Critique Edit General
  • We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe.
  • On the agent side, Kernel-Smith maintains a population of executable candidates and iteratively improves them using an archive of top-performing and diverse programs together with structured execution feedback on compilation, correctness,…
RASPRef: Retrieval-Augmented Self-Supervised Prompt Refinement for Large Reasoning Models

Rahul Soni · Mar 27, 2026 · Citations: 0

Critique Edit Math
  • Recent reasoning-focused language models such as DeepSeek R1 and OpenAI o1 have demonstrated strong performance on structured reasoning benchmarks including GSM8K, MATH, and multi-hop question answering tasks.
  • To address this limitation, we introduce Retrieval-Augmented Self-Supervised Prompt Refinement (RASPRef), a framework that improves prompts without requiring human annotations or task-specific supervision.
BeliefShift: Benchmarking Temporal Belief Consistency and Opinion Drift in LLM Agents

Praveen Kumar Myakala, Manan Agrawal, Rahul Manche · Mar 25, 2026 · Citations: 0

Pairwise PreferenceCritique Edit Automatic Metrics General
  • LLMs are increasingly used as long-running conversational agents, yet every major benchmark evaluating their memory treats user information as static facts to be stored and retrieved.
  • We further introduce four novel evaluation metrics: Belief Revision Accuracy (BRA), Drift Coherence Score (DCS), Contradiction Resolution Rate (CRR), and Evidence Sensitivity Index (ESI).
EvoIdeator: Evolving Scientific Ideas through Checklist-Grounded Reinforcement Learning

Andreas Sauter, Yuyue Zhao, Jacopo Urbani, Wenxiang Hu, Zaiqiao Meng, Lun Zhou · Mar 23, 2026 · Citations: 0

Rubric RatingCritique Edit Llm As Judge General
  • EvoIdeator leverages a structured judge model to generate two synergistic signals: (1) lexicographic rewards for multi-dimensional optimization, and (2) fine-grained language feedback that offers span-level critiques regarding grounding,…
Adaptive Robust Estimator for Multi-Agent Reinforcement Learning

Zhongyi Li, Wan Tian, Jingyu Chen, Kangyao Huang, Huiming Zhang, Hui Yang · Mar 23, 2026 · Citations: 0

Critique Edit Math
  • Multi-agent collaboration has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models, yet it suffers from interaction-level ambiguity that blurs generation, critique, and revision, making credit…
  • To address both issues, we propose a robust multi-agent reinforcement learning framework for collaborative reasoning, consisting of two components: Dual-Agent Answer-Critique-Rewrite (DACR) and an Adaptive Robust Estimator (ARE).
ReasonScaffold: A Scaffolded Reasoning-based Annotation Protocol for Human-AI Co-Annotation

Smitha Muthya Sudheendra, Jaideep Srivastava · Mar 22, 2026 · Citations: 0

Critique Edit Automatic Metrics General
  • We evaluate the approach on sentiment classification and opinion detection tasks, analyzing changes in inter-annotator agreement and revision behavior.
  • To quantify these effects, we introduce the Annotator Effort Proxy (AEP), a metric capturing the proportion of labels revised after exposure to reasoning.
PAVE: Premise-Aware Validation and Editing for Retrieval-Augmented LLMs

Tianyi Huang, Caden Yang, Emily Yin, Eric Wang, Michael Zhang · Mar 21, 2026 · Citations: 0

Critique Edit Automatic Metrics Math
  • In controlled ablations with a fixed retriever and backbone, PAVE outperforms simpler post-retrieval baselines in two evidence-grounded QA settings, with the largest gain reaching 32.7 accuracy points on a span-grounded benchmark.
XSkill: Continual Learning from Experience and Skills in Multimodal Agents

Guanyu Jiang, Zhaochen Su, Xiaoye Qu, Yi R. Fung · Mar 12, 2026 · Citations: 0

Critique Edit General
  • Multimodal agents can now tackle complex reasoning tasks with diverse tools, yet they still suffer from inefficient tool use and inflexible orchestration in open-ended settings.
  • To this end, we propose XSkill, a dual-stream framework for continual learning from experience and skills in multimodal agents.
Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge

Mingyang Song, Mao Zheng, Chenning Xu · Mar 11, 2026 · Citations: 0

Rubric RatingCritique Edit Llm As Judge General
  • Through a large-scale study of 105,600 evaluation instances (32 LLMs \times 3 frontier judges \times 100 tasks \times 11 temperatures), we show that model-level agreement (Spearman ρ= 0.99) masks fragile sample-level agreement (Pearson r =…
  • Second, we demonstrate that dynamically generating evaluation rubrics grounded in domain knowledge produces more meaningful assessment.
Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.