Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 13 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang · Apr 6, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 100% Moderate protocol signal Freshness: Hot Status: Ready
Pairwise Preference Automatic Metrics Law
  • Via the trigonometric series, we use the distance preference characterized by these centers to score keys according to their positions, and also leverage Q/K norms as an additional signal for importance estimation.
Open paper

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 100% High protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics MathLaw
  • We further demonstrate that constructing DPO preference pairs from NSRSA verification teaches the model to distinguish sound from flawed reasoning (reward accuracy 46% to 63%).
Open paper
Sabiá-4 Technical Report

Thiago Laitz, Thales Sales Almeida, Hugo Abonizio, Roseval Malaquias Junior, Giovana Kerche Bonás, Marcos Piau · Mar 10, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 100% High protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics Tool Use LawCoding
  • The models were developed through a four-stage training pipeline: continued pre-training on Portuguese and Brazilian legal corpora, long-context extension to 128K tokens, supervised fine-tuning on instruction data spanning chat, code, legal…
  • We evaluate the models on six benchmark categories: conversational capabilities in Brazilian Portuguese, knowledge of Brazilian legislation, long-context understanding, instruction following, standardized exams, and agentic capabilities…
Open paper
\$OneMillion-Bench: How Far are Language Agents from Human Experts?

Qianyu Yang, Yang Liu, Jiaqi Li, Jun Bai, Hao Chen, Kaiyuan Chen · Mar 9, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 100% High protocol signal Freshness: Warm Status: Ready
Rubric Rating Automatic Metrics Tool Use Law
  • To this end, we introduce \OneMillion-Bench \OneMillion-Bench, a benchmark of 400 expert-curated tasks spanning Law, Finance, Industry, Healthcare, and Natural Science, built to evaluate agents across economically consequential scenarios.
  • We adopt a rubric-based evaluation protocol scoring factual accuracy, logical coherence, practical feasibility, and professional compliance, focused on expert-level problems to ensure meaningful differentiation across agents.
Open paper
HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam

Weiqi Zhai, Zhihai Wang, Jinghang Wang, Boyu Yang, Xiaogang Li, Xander Xu · Feb 15, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 100% High protocol signal Freshness: Warm Status: Ready
Expert VerificationCritique Edit Automatic Metrics Law
  • Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions.
  • However, community-led analyses have raised concerns that HLE contains a non-trivial number of noisy items, which can bias evaluation results and distort cross-model comparisons.
Open paper
LEXam: Benchmarking Legal Reasoning on 340 Law Exams

Yu Fan, Jingwei Ni, Jakob Merane, Yang Tian, Yoan Hermstrüwer, Yinya Huang · May 19, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 98% Moderate protocol signal Freshness: Cold Status: Ready
Llm As JudgeAutomatic Metrics Long Horizon Law
  • To address this, we introduce LEXam, a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels.
  • Deploying an ensemble LLM-as-a-Judge paradigm with rigorous human expert validation, we demonstrate how model-generated reasoning steps can be evaluated consistently and accurately, closely aligning with human expert assessments.
Open paper
Agent Q-Mix: Selecting the Right Action for LLM Multi-Agent Systems through Reinforcement Learning

Eric Hanchen Jiang, Levina Li, Rui Sun, Xiao Liang, Yubei Li, Yuchen Wu · Apr 1, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 100% High protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Multi Agent MathLaw
  • In this paper, we propose Agent Q-Mix, a reinforcement learning framework that reformulates topology selection as a cooperative Multi-Agent Reinforcement Learning (MARL) problem.
  • Across seven core benchmarks in coding, reasoning, and mathematics, Agent Q-Mix achieves the highest average accuracy compared to existing methods while demonstrating superior token efficiency and robustness against agent failure.
Open paper
Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim Verification

Masnun Nuha Chowdhury, Nusrat Jahan Beg, Umme Hunny Khan, Syed Rifat Raiyan, Md Kamrul Hasan, Hasan Mahmud · Mar 30, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 100% High protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Multi Agent LawCoding
  • We propose a courtroom-style multi-agent framework, PROClaim, that reformulates verification as a structured, adversarial deliberation.
  • In zero-shot evaluations on the Check-COVID benchmark, PROClaim achieves 81.7% accuracy, outperforming standard multi-agent debate by 10.0 percentage points, with P-RAG driving the primary performance gains (+7.5 pp).
Open paper

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 100% High protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Multi Agent LawCoding
  • LLM coding benchmarks face a credibility crisis: widespread solution leakage and test quality issues undermine SWE-bench Verified, while existing detection methods--paraphrase consistency, n-gram overlap, perplexity analysis--never directly…
  • We introduce Cross-Context Verification (CCV), a black-box method that solves the same benchmark problem in N independent sessions and measures solution diversity, combined with the Hierarchical Cross-Context Architecture (HCCA), a…
Open paper
Structured Linked Data as a Memory Layer for Agent-Orchestrated Retrieval

Andrea Volpini, Elie Raad, Beatrice Gamba, David Riccitelli · Mar 11, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 100% Moderate protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Web Browsing Law
  • In this paper, we investigate whether structured linked data, specifically Schema.org markup and dereferenceable entity pages served by a Linked Data Platform, can improve retrieval accuracy and answer quality in both standard and agentic…
  • Our results reveal that while JSON-LD markup alone provides only modest improvements, our enhanced entity page format, incorporating llms.txt-style agent instructions, breadcrumbs, and neural search capabilities, achieves substantial gains:…
Open paper
MAWARITH: A Dataset and Benchmark for Legal Inheritance Reasoning with LLMs

Abdessalam Bouchekif, Shahd Gaben, Samer Rashwani, Somaya Eltanbouly, Mutaz Al-Khatib, Heba Sbahi · Mar 8, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 100% Moderate protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Long Horizon Law
  • To evaluate models beyond final-answer accuracy, we propose MIR-E (Mawarith Inheritance Reasoning Evaluation), a weighted multi-stage metric that scores key reasoning stages and captures error propagation across the pipeline.
  • The MAWARITH dataset is publicly available at https://github.com/bouchekif/inheritance_evaluation.
Open paper

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 98% Moderate protocol signal Freshness: Cold Status: Fallback
Automatic Metrics Multi Agent LawCoding
  • We present L-MARS (Legal Multi-Agent Workflow with Orchestrated Reasoning and Agentic Search), a multi-agent retrieval framework for grounded legal question answering that decomposes queries into structured sub-problems, retrieves evidence…
  • We introduce LegalSearchQA, a 50-question benchmark across five legal domains whose answers depend on recent developments that post-date model training data.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.