- When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation
Henry Peng Zou, Chunyu Miao, Wei-Chieh Huang, Yankai Chen, Yue Zhou · Apr 1, 2026 · Citations: 0
Critique Edit Simulation Env Long Horizon
As LLM agents transition from short, static problem solving to executing complex, long-horizon tasks in dynamic environments, the ability to handle user interruptions, such as adding requirement or revising goals, during mid-task execution…
- RASPRef: Retrieval-Augmented Self-Supervised Prompt Refinement for Large Reasoning Models
Rahul Soni · Mar 27, 2026 · Citations: 0
Critique Edit Long Horizon
Recent reasoning-focused language models such as DeepSeek R1 and OpenAI o1 have demonstrated strong performance on structured reasoning benchmarks including GSM8K, MATH, and multi-hop question answering tasks.
- ReasonScaffold: A Scaffolded Reasoning-based Annotation Protocol for Human-AI Co-Annotation
Smitha Muthya Sudheendra, Jaideep Srivastava · Mar 22, 2026 · Citations: 0
Critique Edit Automatic Metrics
We evaluate the approach on sentiment classification and opinion detection tasks, analyzing changes in inter-annotator agreement and revision behavior.
- PAVE: Premise-Aware Validation and Editing for Retrieval-Augmented LLMs
Tianyi Huang, Caden Yang, Emily Yin, Eric Wang, Michael Zhang · Mar 21, 2026 · Citations: 0
Critique Edit Automatic Metrics
In controlled ablations with a fixed retriever and backbone, PAVE outperforms simpler post-retrieval baselines in two evidence-grounded QA settings, with the largest gain reaching 32.7 accuracy points on a span-grounded benchmark.
- EvoIdeator: Evolving Scientific Ideas through Checklist-Grounded Reinforcement Learning
Andreas Sauter, Yuyue Zhao, Jacopo Urbani, Wenxiang Hu, Zaiqiao Meng · Mar 23, 2026 · Citations: 0
Rubric RatingCritique Edit Llm As Judge
EvoIdeator leverages a structured judge model to generate two synergistic signals: (1) lexicographic rewards for multi-dimensional optimization, and (2) fine-grained language feedback that offers span-level critiques regarding grounding,…
- Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization
He Du, Qiming Ge, Jiakai Hu, Aijun Yang, Zheng Cai · Mar 30, 2026 · Citations: 0
Critique Edit Long Horizon
We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe.
- How Much LLM Does a Self-Revising Agent Actually Need?
Sungwoo Jung, Seonil Son · Apr 8, 2026 · Citations: 0
Critique Edit Automatic Metrics
Recent LLM-based agents often place world modeling, planning, and reflection inside a single language model loop.
- The Ultimate Tutorial for AI-driven Scale Development in Generative Psychometrics: Releasing AIGENIE from its Bottle
Lara Russell-Lasalandra, Hudson Golino, Luis Eduardo Garrido, Alexander P. Christensen · Mar 30, 2026 · Citations: 0
Critique Edit Tool Use
Psychological scale development has traditionally required extensive expert involvement, iterative revision, and large-scale pilot testing before psychometric evaluation can begin.
- Can Large Language Models Self-Correct in Medical Question Answering? An Exploratory Study
Zaifu Zhan, Mengyuan Cui, Rui Zhang · Mar 31, 2026 · Citations: 0
Critique Edit Automatic Metrics
Large language models (LLMs) have achieved strong performance on medical question answering (medical QA), and chain-of-thought (CoT) prompting has further improved results by eliciting explicit intermediate reasoning; meanwhile,…
- BeliefShift: Benchmarking Temporal Belief Consistency and Opinion Drift in LLM Agents
Praveen Kumar Myakala, Manan Agrawal, Rahul Manche · Mar 25, 2026 · Citations: 0
Pairwise PreferenceCritique Edit Automatic Metrics
LLMs are increasingly used as long-running conversational agents, yet every major benchmark evaluating their memory treats user information as static facts to be stored and retrieved.
- Optimsyn: Influence-Guided Rubrics Optimization for Synthetic Data Generation
Zhiting Fan, Ruizhe Chen, Tianxiang Hu, Ru Peng, Zenan Huang · Apr 1, 2026 · Citations: 0
Rubric RatingCritique Edit
However, high-quality SFT data in knowledge-intensive domains such as humanities, social sciences, medicine, law, and finance is scarce because expert curation is expensive, privacy constraints are strict, and label consistency is hard to…
- Adaptive Robust Estimator for Multi-Agent Reinforcement Learning
Zhongyi Li, Wan Tian, Jingyu Chen, Kangyao Huang, Huiming Zhang · Mar 23, 2026 · Citations: 0
Critique Edit Multi Agent
Multi-agent collaboration has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models, yet it suffers from interaction-level ambiguity that blurs generation, critique, and revision, making credit…
- From Hallucination to Structure Snowballing: The Alignment Tax of Constrained Decoding in LLM Reflection
Hongxu Zhou · Apr 7, 2026 · Citations: 0
Critique Edit
While structured feedback can mitigate this issue, existing approaches often rely on externally trained critics or symbolic tools, reducing agent autonomy.
- The Self Driving Portfolio: Agentic Architecture for Institutional Asset Management
Andrew Ang, Nazym Azimbayev, Andrey Kim · Apr 2, 2026 · Citations: 0
Critique Edit
Agentic AI shifts the investor's role from analytical execution to oversight.
- Revision or Re-Solving? Decomposing Second-Pass Gains in Multi-LLM Pipelines
Jingjie Ning, Xueqi Li, Chengyu Yu · Apr 1, 2026 · Citations: 0
Critique Edit
We evaluate this design across two model pairs on three benchmarks spanning knowledge-intensive MCQ and competitive programming.
- EarlySciRev: A Dataset of Early-Stage Scientific Revisions Extracted from LaTeX Writing Traces
Léane Jourdan, Julien Aubert-Béduchaud, Yannis Chupin, Marah Baccari, Florian Boudin · Mar 30, 2026 · Citations: 0
Critique Edit
This limits empirical study of revision behaviour and evaluation of large language models (LLMs) for scientific writing.
- Understanding Teacher Revisions of Large Language Model-Generated Feedback
Conrad Borchers, Luiz Rodrigues, Newarney Torrezão da Costa, Cleon Xavier, Rafael Ferreira Mello · Mar 29, 2026 · Citations: 0
Critique EditRlaif Or Synthetic Feedback
First, we find that teachers accept AI feedback without modification in about 80% of cases, while edited feedback tends to be significantly longer and subsequently shortened by teachers.
- How Psychological Learning Paradigms Shaped and Constrained Artificial Intelligence
Alex Anvi Eponon, Ildar Batyrshin, Christian E. Maldonado-Sifuentes, Grigori Sidorov · Mar 18, 2026 · Citations: 0
Critique Edit
The dominant paradigms of artificial intelligence were shaped by learning theories from psychology: behaviorism inspired reinforcement learning, cognitivism gave rise to deep learning and memory-augmented architectures, and constructivism…