OpenTrain Research Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 5 Search mode: keyword RSS

Filter by tag

All Automatic Metrics (978) General (590) Coding (314) Simulation Env (115) Math (103) Multilingual (99) Long Horizon (82) Medicine (78) Pairwise Preference (70) Law (45) Multi Agent (41) Human Eval (38) Expert Verification (25) Web Browsing (22) Critique Edit (21) Red Team (21)

Who can we trust? LLM-as-a-jury for Comparative Assessment

Mengjie Qian, Guangzhi Sun, Mark J. F. Gales, Kate M. Knill · Feb 18, 2026

Citations: 0

Pairwise Preference Automatic Metrics General

Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwise comparative judgements.
Existing approaches typically rely on single judges or aggregate multiple judges assuming equal reliability.

Cross-Cultural Expert-Level Art Critique Evaluation with Vision-Language Models

Haorui Yu, Xuehang Wen, Fengrui Zhang, Qiufeng Yi · Jan 12, 2026

Citations: 0

Rubric RatingCritique Edit Automatic Metrics Coding

Existing benchmarks assess perception without interpretation, and common evaluation proxies, such as automated metrics and LLM-judge averaging, are unreliable for culturally sensitive generative tasks.
We address this measurement gap with a tri-tier evaluation framework grounded in art-theoretical constructs (Section 2).

Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling

Jeffrey T. H. Wong, Zixi Zhang, Junyi Liu, Yiren Zhao · Feb 18, 2026

Citations: 0

Expert Verification Automatic Metrics Multi Agent Coding

Existing Multi-Agent Systems (MAS) typically rely on static, homogeneous model configurations, limiting their ability to exploit the distinct strengths of differently post-trained models.
To address this, we introduce Team-of-Thoughts, a novel MAS architecture that leverages the complementary capabilities of heterogeneous agents via an orchestrator-tool paradigm.

PMG: Parameterized Motion Generator for Human-like Locomotion Control

Chenxi Han, Yuheng Min, Zihao Huang, Ao Hong, Hang Liu, Yi Cheng · Feb 13, 2026

Citations: 0

Automatic Metrics Long Horizon General

Recent advances in data-driven reinforcement learning and motion tracking have substantially improved humanoid locomotion, yet critical practical challenges remain.
To address these limitations, we propose the Parameterized Motion Generator (PMG), a real-time motion generator grounded in an analysis of human motion structure that synthesizes reference trajectories using only a compact set of parameteri

MCPShield: A Security Cognition Layer for Adaptive Trust Calibration in Model Context Protocol Agents

Zhenhong Zhou, Yuanhe Zhang, Hongwei Cai, Moayad Aloqaily, Ouns Bouachir, Linsey Pang · Feb 15, 2026

Citations: 0

Automatic Metrics Tool Use General

The Model Context Protocol (MCP) standardizes tool use for LLM-based agents and enable third-party servers.
This openness introduces a security misalignment: agents implicitly trust tools exposed by potentially untrusted MCP servers.

Protocol Hubs

Expert Verification Papers (25) CS.CL + Expert Verification Papers (20) Pairwise Preference Papers (70) CS.CL + Pairwise Preference Papers (62) CS.AI + Expert Verification Papers (15) CS.AI + Pairwise Preference Papers (42) Rubric Rating Papers (17) CS.CL + Rubric Rating Papers (16) General + Pairwise Preference Papers (43) Expert Verification Or Rubric Rating Papers (39) CS.CL + Math Papers (84) Long Horizon Papers (82) CS.CL + Human Eval Papers (35) CS.CL + Long Horizon Papers (58) Expert Verification + Medicine Papers (11) Human Eval Papers (38)

Human Feedback and Eval Paper Explorer

Filter by tag

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives