Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 55 Search mode: keyword Ranking: eval-signal prioritized Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: high protocol signal.

Score: 88% High protocol signal Freshness: Cold Status: Ready
Pairwise Preference Automatic Metrics Multi Agent General
  • While a multi-agent approach based on large language models (LLMs) represents a promising strategy to surpass the capabilities of single models, its success is critically dependent on synergistic team composition.
  • Priming conversations with specific topics identified synergistic teams which outperform random baselines on downstream benchmarks and achieve comparable accuracy to that of manually-curated teams based on known model specializations.
Open paper
Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models

Yubo Li, Xiaobin Shen, Xinyu Yao, Xueying Ding, Yidi Miao, Ramayya Krishnan · Apr 7, 2025

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: high protocol signal.

Score: 88% High protocol signal Freshness: Cold Status: Ready
Red Team Automatic Metrics MathCoding
  • We organize existing benchmarks and datasets into coherent categories reflecting the evolving landscape of multi-turn dialogue evaluation, and review a broad spectrum of enhancement methodologies, including model-centric strategies…
Open paper
Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs

Mohammad Tavakoli, Alireza Salemi, Carrie Ye, Mohamed Abdalla, Hamed Zamani, J Ross Mitchell · Oct 31, 2025

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 83% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics General
  • Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative…
  • Second, to enhance model performance, we propose LIGHT-a framework inspired by human cognition that equips LLMs with three complementary memory systems: a long-term episodic memory, a short-term working memory, and a scratchpad for…
Open paper
LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning

Yuhao Wu, Yushi Bai, Zhiqiang Hu, Roy Ka-Wei Lee, Juanzi Li · Jun 23, 2025

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 83% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics General
  • Experimental evaluations show that our LongWriter-Zero model, trained from Qwen2.5-32B, consistently outperforms traditional SFT methods on long-form writing tasks, achieving state-of-the-art results across all metrics on WritingBench and…
Open paper
Structure-Augmented Reasoning Generation

Jash Rajesh Parekh, Pengcheng Jiang, Jiawei Han · Jun 10, 2025

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 83% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics General
  • Extensive experiments on open-domain QA benchmarks and specialized reasoning datasets in finance and medicine demonstrate that SARG significantly outperforms state-of-the-art flat-context RAG baselines in both factual accuracy and reasoning…
Open paper
Structured Legal Document Generation in India: A Model-Agnostic Wrapper Approach with VidhikDastaavej

Shubham Kumar Nigam, Balaramamahanthi Deepak Patnaik, Noel Shallum, Kripabandhu Ghosh, Arnab Bhattacharya · Apr 4, 2025

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 83% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics Law
  • Comprehensive evaluation, including lexical, semantic, LLM-based, and expert-driven assessments with inter-annotator agreement, shows that the wrapper substantially improves factual accuracy, coherence, and completeness compared to…
  • This work establishes both a new benchmark dataset and a generalizable generation framework, paving the way for future research in AI-assisted legal drafting.
Open paper
Beyond Fact Retrieval: Episodic Memory for RAG with Generative Semantic Workspaces

Shreyas Rajesh, Pavan Holur, Chenda Duan, David Chong, Vwani Roychowdhury · Nov 10, 2025

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: high protocol signal.

Score: 88% High protocol signal Freshness: Cold Status: Fallback
Automatic Metrics Long Horizon Coding
  • On the Episodic Memory Benchmark (EpBench) huet_episodic_2025 comprising corpora ranging from 100k to 1M tokens in length, GSW outperforms existing RAG based baselines by up to 20\%.
  • More broadly, GSW offers a concrete blueprint for endowing LLMs with human-like episodic memory, paving the way for more capable agents that can reason over long horizons.
Open paper
Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models

Teng Wang, Zhangyi Jiang, Zhenqi He, Shenyang Tong, Wenhan Yang, Yanan Zheng · Mar 16, 2025

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: high protocol signal.

Score: 88% High protocol signal Freshness: Cold Status: Fallback
Automatic Metrics Long Horizon MathLaw
  • Empirical results on the PRM800K dataset show that HRM, together with HNC, provides more stable and reliable evaluations than PRM.
  • Furthermore, cross-domain evaluations on the MATH500 and GSM8K datasets demonstrate HRM's strong generalization and robustness across a variety of reasoning tasks.
Open paper
Automatic Essay Scoring and Feedback Generation in Basque Language Learning

Ekhi Azurmendi, Xabier Arregi, Oier Lopez de Lacalle · Dec 9, 2025

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 78% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics General
  • We also propose a novel evaluation methodology for assessing feedback generation, combining automatic consistency metrics with expert-based validation of extracted learner errors.
  • This resource and benchmark establish a foundation for transparent, reproducible, and educationally grounded NLP research in low-resource languages such as Basque.
Open paper

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 78% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics Medicine
  • Results demonstrated substantial improvements through RLs over baseline GPT-2 across multiple evaluation metrics: BLEU (0.0111), ROUGE-1 (0.1397), ROUGE-2 (0.0213), ROUGE-L (0.1317), and METEOR (0.0581).
  • LLM evaluation confirmed high contextual relevance and professionalism, while RL achieved 99.34% emotion accuracy compared to 66.96% for baseline GPT-2.
Open paper

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 78% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics Multilingual
  • Existing Bengali topic modeling research lacks standardized evaluation frameworks with comprehensive baselines and diverse datasets, exploration of modern methodological approaches, and reproducible implementations, with only three…
  • Additionally, we introduce NCTBText, a diverse Bengali textbook-based dataset comprising 8,650 text documents, curated from eight subject areas, providing much-needed topical diversity beyond newspaper-centric Bengali corpora and serving as…
Open paper
EconEvals: Benchmarks and Litmus Tests for Economic Decision-Making by LLM Agents

Sara Fish, Julia Shephard, Minkai Li, Ran I. Shorrer, Yannai A. Gonczarowski · Mar 24, 2025

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 78% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics General
  • We develop evaluation methods for measuring the economic decision-making capabilities and tendencies of LLMs.
  • First, we develop benchmarks derived from key problems in economics -- procurement, scheduling, and pricing -- that test an LLM's ability to learn from the environment in context.
Open paper
Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 78% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics Medicine
  • The framework is evaluated on four medical imaging benchmark datasets: chest X-rays of COVID-19, Tuberculosis, Pneumonia, and retinal Optical Coherence Tomography (OCT) images.
Open paper
Topic-Based Watermarks for Large Language Models

Alexander Nemecek, Yuzhou Jiang, Erman Ayday · Apr 2, 2024

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 78% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics General
  • The indistinguishability of large language model (LLM) output from human-authored content poses significant challenges, raising concerns about potential misuse of AI-generated text and its influence on future model training.
  • Experimental results across multiple LLMs and state-of-the-art benchmarks demonstrate that our method achieves text quality comparable to industry-leading systems and simultaneously improves watermark robustness against paraphrasing and…
Open paper
LayerT2V: A Unified Multi-Layer Video Generation Framework

Guangzhao Li, Kangrui Cen, Baixuan Zhao, Yi Xin, Siqi Luo, Guangtao Zhai · Aug 6, 2025

Citations: 0

Match reason: Keyword overlap 3/3 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 83% Moderate protocol signal Freshness: Cold Status: Fallback
Automatic Metrics Long Horizon General
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.