Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 2 Search mode: hybrid RSS

Filter by tag

All Automatic Metrics (527) General (186) Long Horizon (106) Pairwise Preference (91) Coding (69) Simulation Env (67) Multi Agent (46) Medicine (35) Expert Verification (33) Llm As Judge (28) Human Eval (25) Web Browsing (25) Rubric Rating (24) Red Team (23) Critique Edit (22) Multilingual (21)

ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning

Hyeonje Choi, Jeongsoo Lee, Hyojun Lee, Jay-Yoon Lee · Feb 24, 2026

Citations: 0

Automatic Metrics Long Horizon Math

We introduce \ToolMATH, a math-grounded benchmark that evaluates tool-augmented language models in realistic multi-tool environments where the output depends on calling schema-specified tools and sustaining multi-step execution.
It turns math problems into a controlled, correctness-checkable benchmark with tool sets, enabling systematic evaluation of model reliability under (1) large, overlapping tool catalogs and (2) the absence of the intended capability.

Modeling Expert AI Diagnostic Alignment via Immutable Inference Snapshots

Dimitrios P. Panagoulias, Evangelia-Aikaterini Tsichrintzi, Georgios Savvidis, Evridiki Tsoureli-Nikita · Feb 26, 2026

Citations: 0

Expert Verification Automatic Metrics Medicine

Human-in-the-loop validation is essential in safety-critical clinical AI, yet the transition between initial model inference and expert correction is rarely analyzed as a structured signal.
Evaluation on 21 dermatological cases (21 complete AI physician pairs) em- ployed a four-level concordance framework comprising exact primary match rate (PMR), semantic similarity-adjusted rate (AMR), cross-category alignment, and…

Protocol Hubs

Expert Verification Papers (32) CS.CL + Expert Verification Papers (24) Pairwise Preference Papers (89) CS.CL + Pairwise Preference Papers (74) Coding Papers (69) CS.CL Human Feedback And Eval Papers (1,020) CS.AI + Expert Verification Papers (20) CS.AI Human Feedback And Eval Papers (794) Expert Verification Or Pairwise Preference Papers (118) Pairwise Preference Papers (Last 120 Days) (59) Pairwise Preference Papers (Last 90 Days) (58) Pairwise Preference Papers (Last 60 Days) (57) Long Horizon Papers (101) CS.AI + Pairwise Preference Papers (52) Expert Verification Or Rubric Rating Papers (50) CS.CL + Coding Papers (51)

Benchmark Hubs

WebArena Ecosystem Benchmark Papers (13)

Metric Hubs

Daily Archives

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote