Recent LLM-based agents often place world modeling, planning, and reflection inside a single language model loop.
We introduce a declared reflective runtime protocol that externalizes agent state, confidence signals, guarded actions, and hypothetical transitions into inspectable runtime structure.
While structured feedback can mitigate this issue, existing approaches often rely on externally trained critics or symbolic tools, reducing agent autonomy.
This observation helps explain why the agent achieves near-perfect superficial syntactic alignment yet fails to detect or resolve deeper semantic errors.
Andrew Ang, Nazym Azimbayev, Andrey Kim · Apr 2, 2026 · Citations: 0
Critique EditCoding
Agentic AI shifts the investor's role from analytical execution to oversight.
We present an agentic strategic asset allocation pipeline in which approximately 50 specialized agents produce capital market assumptions, construct portfolios using over 20 competing methods, and critique and vote on each other's output.
As LLM agents transition from short, static problem solving to executing complex, long-horizon tasks in dynamic environments, the ability to handle user interruptions, such as adding requirement or revising goals, during mid-task execution…
In this paper, we present the first systematic study of interruptible agents in long-horizon, environmentally grounded web navigation tasks, where actions induce persistent state changes.
However, high-quality SFT data in knowledge-intensive domains such as humanities, social sciences, medicine, law, and finance is scarce because expert curation is expensive, privacy constraints are strict, and label consistency is hard to…
Large language models (LLMs) have achieved strong performance on medical question answering (medical QA), and chain-of-thought (CoT) prompting has further improved results by eliciting explicit intermediate reasoning; meanwhile,…
In this work, we conduct an exploratory analysis of self-reflective reasoning for medical multiple-choice question answering: using GPT-4o and GPT-4o-mini, we compare standard CoT prompting with an iterative self-reflection loop and track…
Lara Russell-Lasalandra, Hudson Golino, Luis Eduardo Garrido, Alexander P. Christensen · Mar 30, 2026 · Citations: 0
Critique EditCoding
Psychological scale development has traditionally required extensive expert involvement, iterative revision, and large-scale pilot testing before psychometric evaluation can begin.
The `AIGENIE` R package implements the AI-GENIE framework (Automatic Item Generation with Network-Integrated Evaluation), which integrates large language model (LLM) text generation with network psychometric methods to automate the early…
He Du, Qiming Ge, Jiakai Hu, Aijun Yang, Zheng Cai, Zixian Huang · Mar 30, 2026 · Citations: 0
Critique EditGeneral
We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe.
On the agent side, Kernel-Smith maintains a population of executable candidates and iteratively improves them using an archive of top-performing and diverse programs together with structured execution feedback on compilation, correctness,…
Recent reasoning-focused language models such as DeepSeek R1 and OpenAI o1 have demonstrated strong performance on structured reasoning benchmarks including GSM8K, MATH, and multi-hop question answering tasks.
To address this limitation, we introduce Retrieval-Augmented Self-Supervised Prompt Refinement (RASPRef), a framework that improves prompts without requiring human annotations or task-specific supervision.
LLMs are increasingly used as long-running conversational agents, yet every major benchmark evaluating their memory treats user information as static facts to be stored and retrieved.
We further introduce four novel evaluation metrics: Belief Revision Accuracy (BRA), Drift Coherence Score (DCS), Contradiction Resolution Rate (CRR), and Evidence Sensitivity Index (ESI).
Andreas Sauter, Yuyue Zhao, Jacopo Urbani, Wenxiang Hu, Zaiqiao Meng, Lun Zhou · Mar 23, 2026 · Citations: 0
Rubric RatingCritique EditLlm As JudgeGeneral
EvoIdeator leverages a structured judge model to generate two synergistic signals: (1) lexicographic rewards for multi-dimensional optimization, and (2) fine-grained language feedback that offers span-level critiques regarding grounding,…
Zhongyi Li, Wan Tian, Jingyu Chen, Kangyao Huang, Huiming Zhang, Hui Yang · Mar 23, 2026 · Citations: 0
Critique EditMath
Multi-agent collaboration has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models, yet it suffers from interaction-level ambiguity that blurs generation, critique, and revision, making credit…
To address both issues, we propose a robust multi-agent reinforcement learning framework for collaborative reasoning, consisting of two components: Dual-Agent Answer-Critique-Rewrite (DACR) and an Adaptive Robust Estimator (ARE).
We evaluate the approach on sentiment classification and opinion detection tasks, analyzing changes in inter-annotator agreement and revision behavior.
To quantify these effects, we introduce the Annotator Effort Proxy (AEP), a metric capturing the proportion of labels revised after exposure to reasoning.
Tianyi Huang, Caden Yang, Emily Yin, Eric Wang, Michael Zhang · Mar 21, 2026 · Citations: 0
Critique EditAutomatic MetricsMath
In controlled ablations with a fixed retriever and backbone, PAVE outperforms simpler post-retrieval baselines in two evidence-grounded QA settings, with the largest gain reaching 32.7 accuracy points on a span-grounded benchmark.
Guanyu Jiang, Zhaochen Su, Xiaoye Qu, Yi R. Fung · Mar 12, 2026 · Citations: 0
Critique EditGeneral
Multimodal agents can now tackle complex reasoning tasks with diverse tools, yet they still suffer from inefficient tool use and inflexible orchestration in open-ended settings.
To this end, we propose XSkill, a dual-stream framework for continual learning from experience and skills in multimodal agents.
Mingyang Song, Mao Zheng, Chenning Xu · Mar 11, 2026 · Citations: 0
Rubric RatingCritique EditLlm As JudgeGeneral
Through a large-scale study of 105,600 evaluation instances (32 LLMs \times 3 frontier judges \times 100 tasks \times 11 temperatures), we show that model-level agreement (Spearman ρ= 0.99) masks fragile sample-level agreement (Pearson r =…
Second, we demonstrate that dynamically generating evaluation rubrics grounded in domain knowledge produces more meaningful assessment.
Get Started
Join the #1 Platform for AI Training Talent
Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.