Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 1 Search mode: keyword RSS

Filter by tag

All Automatic Metrics (532) General (189) Long Horizon (106) Pairwise Preference (92) Coding (71) Simulation Env (67) Multi Agent (48) Medicine (35) Expert Verification (33) Llm As Judge (29) Human Eval (26) Web Browsing (26) Red Team (24) Rubric Rating (24) Multilingual (23) Critique Edit (22)

LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?

Guozhao Mo, Wenliang Zhong, Jiawei Chen, Qianhao Yuan, Xuanang Chen, Yaojie Lu · Aug 3, 2025

Citations: 0

Llm As Judge Tool Use MedicineCoding

Unfortunately, there is still a large gap between real-world MCP usage and current evaluation: they typically assume single-server settings and directly inject tools into the model's context, bypassing the challenges of large-scale…
We benchmark 12 state-of-the-art LLMs and observe a substantial performance gap: while Claude-Sonnet-4 reaches 78.95% task success, most models achieve only 30-50%.

Protocol Hubs

Expert Verification Papers (32) CS.CL + Expert Verification Papers (24) Pairwise Preference Papers (89) CS.CL + Pairwise Preference Papers (74) Coding Papers (69) CS.CL Human Feedback And Eval Papers (1,020) CS.AI + Expert Verification Papers (20) CS.AI Human Feedback And Eval Papers (794) Expert Verification Or Pairwise Preference Papers (118) Pairwise Preference Papers (Last 120 Days) (59) Pairwise Preference Papers (Last 90 Days) (58) Pairwise Preference Papers (Last 60 Days) (57) Long Horizon Papers (101) CS.AI + Pairwise Preference Papers (52) Expert Verification Or Rubric Rating Papers (50) CS.CL + Coding Papers (51)

Benchmark Hubs

WebArena Ecosystem Benchmark Papers (13)

Metric Hubs

Daily Archives

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote