Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 2 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (630) General (243) Pairwise Preference (128) Long Horizon (127) Coding (96) Simulation Env (86) Multi Agent (62) Medicine (40) Expert Verification (39) Llm As Judge (39) Web Browsing (34) Rubric Rating (32) Demonstrations (31) Red Team (30) Human Eval (29) Math (29)

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents

Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou · Feb 15, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 90% Moderate protocol signal Freshness: Hot Status: Fallback

Simulation Env Long Horizon General

The paper introduces GUI-Owl-1.5, the latest native GUI agent model that features instruct/thinking variants in multiple sizes (2B/4B/8B/32B/235B) and supports a range of platforms (desktop, mobile, browser, and more) to enable cloud-edge…
(2) Unified Enhancement of Agent Capabilities: we use a unified thought-synthesis pipeline to enhance the model's reasoning capabilities, while placing particular emphasis on improving key agent abilities, including Tool/MCP use, memory and…

Open paper

CoAct-1: Computer-using Multi-Agent System with Coding Actions

Linxin Song, Yutong Dai, Viraj Prabhu, Jieyu Zhang, Taiwei Shi, Li Li · Aug 5, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 78% High protocol signal Freshness: Cold Status: Fallback

Automatic Metrics Long Horizon Coding

In this work, we introduce a more robust and flexible paradigm: enabling agents to use coding as a enhanced action.
We evaluate our system on the challenging OSWorld benchmark, where CoAct-1 achieves a new state-of-the-art success rate of 60.76%, significantly outperforming prior methods.

Open paper

Protocol Hubs

Expert Verification Papers (39) CS.CL + Expert Verification Papers (30) Rubric Rating Papers (29) CS.CL + Rubric Rating Papers (27) CS.AI + Expert Verification Papers (25) Pairwise Preference Papers (114) CS.CL + Pairwise Preference Papers (99) Coding Papers (91) CS.CL Human Feedback And Eval Papers (1,674) CS.CL + Coding Papers (73) Expert Verification Papers (Last 120 Days) (30) Medicine + Expert Verification Papers (20) Expert Verification Papers (Last 90 Days) (29) Medicine Papers (40) CS.AI + Medicine Papers (25) Automatic Metrics + Expert Verification Papers (25)

Benchmark Hubs

Metric Hubs

Daily Archives

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote