Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 216 Search mode: keyword RSS
SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?

Shiqi Chen, Jingze Gai, Ruochen Zhou, Jinghan Zhang, Tongyao Zhu, Junlong Li · Feb 28, 2026

Citations: 0
Automatic Metrics Long Horizon General
  • Real-world tool-using agents operate over long-horizon workflows with recurring structure and diverse demands, where effective behavior requires not only invoking atomic tools but also abstracting, and reusing higher-level tool…
  • Evaluating state-of-the-art agents on SkillCraft, we observe substantial efficiency gains, with token usage reduced by up to 80% by skill saving and reuse.
DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science

Fan Shu, Yite Wang, Ruofan Wu, Boyi Liu, Zhewei Yao, Yuxiong He · Feb 27, 2026

Citations: 0
Automatic Metrics Long Horizon General
  • The fast-growing demands in using Large Language Models (LLMs) to tackle complex multi-step data science tasks create an emergent need for accurate benchmarking.
  • To bridge these gaps, we introduce DARE-bench, a benchmark designed for machine learning modeling and data science instruction following.
LFQA-HP-1M: A Large-Scale Human Preference Dataset for Long-Form Question Answering

Rafid Ishrak Jahan, Fahmid Shahriar Iqbal, Sagnik Ray Choudhury · Feb 27, 2026

Citations: 0
Pairwise PreferenceRubric Rating General
  • We present LFQA-HP-1M, a large-scale dataset comprising 1.3M human pairwise preference annotations for LFQA.
  • We propose nine rubrics for answer quality evaluation, and show that simple linear models based on these features perform comparably to state-of-the-art LLM evaluators.
Toward Expert Investment Teams:A Multi-Agent LLM System with Fine-Grained Trading Tasks

Kunihiro Miyazaki, Takanobu Kawahara, Stephen Roberts, Stefan Zohren · Feb 26, 2026

Citations: 0
Pairwise Preference Multi Agent General
  • While mainstream approaches deploy multi-agent systems mimicking analyst and manager roles, they often rely on abstract instructions that overlook the intricacies of real-world workflows, which can lead to degraded inference performance and…
  • Therefore, we propose a multi-agent LLM trading framework that explicitly decomposes investment analysis into fine-grained tasks, rather than providing coarse-grained instructions.
Risk-Aware World Model Predictive Control for Generalizable End-to-End Autonomous Driving

Jiangxin Sun, Feng Xue, Teng Long, Chang Liu, Jian-Fang Hu, Wei-Shi Zheng · Feb 26, 2026

Citations: 0
Demonstrations General
  • Practically, RaWMPC leverages a world model to predict the consequences of multiple candidate actions and selects low-risk actions through explicit risk evaluation.
  • Furthermore, to generate low-risk candidate actions at test time, we introduce a self-evaluation distillation method to distill riskavoidance capabilities from the well-trained world model into a generative action proposal network without…
Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents

Zhou Xu, Bowen Zhou, Qi Wang, Shuwen Feng, Jingyu Xiao · Feb 26, 2026

Citations: 0
Automatic Metrics Web Browsing General
  • Pure-vision GUI agents provide universal interaction capabilities but suffer from severe efficiency bottlenecks due to the massive spatiotemporal redundancy inherent in high-resolution screenshots and historical trajectories.
  • We identify two critical misalignments in existing compression paradigms: the temporal mismatch, where uniform history encoding diverges from the agent's "fading memory" attention pattern, and the spatial topology conflict, where…
Pairwise Preference General
  • Inspired by Humphrey's ipsundrum hypothesis, we implement ReCoN-Ipsundrum, an inspectable agent that extends a ReCoN state machine with a recurrent persistence loop over sensory salience Ns and an optional affect proxy reporting…
  • Across fixed-parameter ablations (ReCoN, Ipsundrum, Ipsundrum+affect), we operationalize Humphrey's qualiaphilia (preference for sensory experience for its own sake) as a familiarity-controlled scenic-over-dull route choice.
Decentralized Ranking Aggregation: Gossip Algorithms for Borda and Copeland Consensus

Anna Van Elst, Kerrian Le Caillec, Igor Colin, Stephan Clémençon · Feb 26, 2026

Citations: 0
Pairwise Preference Multi Agent General
  • The concept of ranking aggregation plays a central role in preference analysis, and numerous algorithms for calculating median rankings, often originating in social choice theory, have been documented in the literature, offering theoretical…
  • peer-to-peer networks, IoT, multi-agent systems), extending the ability to calculate consensus rankings with guarantees in a decentralized setting, i.e., when preference data is initially distributed across a communicating network, remains…
DeepPresenter: Environment-Grounded Reflection for Agentic Presentation Generation

Hao Zheng, Guozhao Mo, Xinru Yan, Qianhao Yuan, Wenkai Zhang, Xuanang Chen · Feb 26, 2026

Citations: 0
Automatic Metrics Long Horizon General
  • However, existing presentation agents often rely on predefined workflows and fixed templates.
  • To address this, we present DeepPresenter, an agentic framework that adapts to diverse user intents, enables effective feedback-driven refinement, and generalizes beyond a scripted pipeline.
Moral Preferences of LLMs Under Directed Contextual Influence

Phil Blandfort, Tushar Karayil, Urja Pawar, Robert Graham, Alex McKenzie, Dmitrii Krasheninnikov · Feb 26, 2026

Citations: 0
Pairwise Preference General
  • Moral benchmarks for LLMs typically use context-free prompts, implicitly assuming stable preferences.
  • We introduce a pilot evaluation harness for directed contextual influence in trolley-problem-style moral triage: for each demographic factor, we apply matched, direction-flipped contextual influences that differ only in which group they…
Unleashing the Potential of Diffusion Models for End-to-End Autonomous Driving

Yinan Zheng, Tianyi Tan, Bin Huang, Enguang Liu, Ruiming Liang, Jianlin Zhang · Feb 26, 2026

Citations: 0
Simulation Env Long Horizon General
  • However, their applications and evaluations in autonomous driving remain limited to simulation-based or laboratory settings.
  • Moreover, we also provide an effective reinforcement learning post-training strategy to further enhance the safety of the learned planner.
AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

Yujie Zhao, Boqin Yuan, Junbo Huang, Haocheng Yuan, Zhongming Yu, Haozhou Xu · Feb 26, 2026

Citations: 0
Automatic Metrics Long Horizon General
  • To bridge this gap, we introduce AMA-Bench (Agent Memory with Any length), which evaluates long-horizon memory for LLMs in real agentic applications.
  • To address these limitations, we propose AMA-Agent, an effective memory system featuring a causality graph and tool-augmented retrieval.
RLHFless: Serverless Computing for Efficient RLHF

Rui Wei, Hanfei Yu, Shubham Jain, Yogarajan Sivakumar, Devesh Tiwari, Jian Li · Feb 26, 2026

Citations: 0
Pairwise Preference Automatic Metrics General
  • Reinforcement Learning from Human Feedback (RLHF) has been widely applied to Large Language Model (LLM) post-training to align model outputs with human preferences.
Same Words, Different Judgments: Modality Effects on Preference Alignment

Aaron Broukhim, Nadir Weibel, Eshin Jolly · Feb 26, 2026

Citations: 0
Pairwise PreferenceRlaif Or Synthetic Feedback Automatic Metrics General
  • Preference-based reinforcement learning (PbRL) is the dominant framework for aligning AI systems to human preferences, but its application to speech remains underexplored.
  • We present a controlled cross-modal study of human and synthetic preference annotations, comparing text and audio evaluations of identical semantic content across 100 prompts.

Protocol Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.