Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 21 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 100% High protocol signal Freshness: Warm Status: Ready
Demonstrations Human EvalLlm As Judge Long Horizon General
  • LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely…
  • We introduce AgentHER, a framework that recovers this lost training signal by adapting the Hindsight Experience Replay (HER; Andrychowicz et al., 2017) principle to natural-language agent trajectories for offline data augmentation.
Open paper
Beyond Rows to Reasoning: Agentic Retrieval for Multimodal Spreadsheet Understanding and Editing

Anmol Gulati, Sahil Sen, Waqar Sarguroh, Kevin Paul · Mar 6, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 100% High protocol signal Freshness: Warm Status: Ready
Human EvalAutomatic Metrics Long Horizon General
  • We introduce Beyond Rows to Reasoning (BRTR), a multimodal agentic framework for spreadsheet understanding that replaces single-pass retrieval with an iterative tool-calling loop, supporting end-to-end Excel workflows from complex analysis…
  • Supported by over 200 hours of expert human evaluation, BRTR achieves state-of-the-art performance across three frontier spreadsheet understanding benchmarks, surpassing prior methods by 25 percentage points on FRTR-Bench, 7 points on…
Open paper
TildeOpen LLM: Leveraging Curriculum Learning to Achieve Equitable Language Representation

Toms Bergmanis, Martins Kronis, Ingus Jānis Pretkalniņš, Dāvis Nicmanis, Jeļizaveta Jeļinska, Roberts Rozis · Mar 9, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 97% Sparse protocol signal Freshness: Warm Status: Ready
Human Eval Multilingual
  • Evaluation across multiple multilingual benchmarks shows that TildeOpen surpasses existing open-weight models in text generation and comprehension, particularly for Baltic, Finno-Ugric, and Slavic languages.
  • Human evaluations confirm an up to tenfold reduction in linguistic errors relative to leading baselines.
Open paper
Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 97% Sparse protocol signal Freshness: Warm Status: Ready
Human Eval Coding
  • Extensive human evaluations with over 1,000 judgments show that TikZilla improves by 1.5-2 points over its base models on a 5-point scale, surpasses GPT-4o by 0.5 points, and matches GPT-5 in the image-based evaluation, while operating at…
Open paper
DialectLLM: A Dialect-Aware Dialog[ue] Generation Framework Beyond Standard American English

Jio Oh, Paul Vicinanza, Thomas Butler, Steven Euijong Whang, Dezhi Hong, Amani Namboori · Jan 30, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 98% High protocol signal Freshness: Cold Status: Ready
Pairwise Preference Human EvalAutomatic Metrics General
  • Human evaluation confirms data quality, with annotators preferring DialectLLM over prior methods in 98.8% of pairwise comparisons for dialect naturalness.
  • Beyond benchmarking, we show that DialectLLM data also serve as a scalable LLM post-training resource, suggesting a practical path toward dialect-aware conversational AI.
Open paper
TaoSR1: The Thinking Model for E-commerce Relevance Search

Chenhe Dong, Shaowei Yao, Pengkun Jiao, Jianhui Yang, Yiming Jin, Zerui Huang · Aug 17, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 98% Moderate protocol signal Freshness: Cold Status: Ready
Pairwise Preference Human Eval General
  • Our framework, TaoSR1, involves three stages: (1) Supervised Fine-Tuning (SFT) with CoT to instill reasoning; (2) Offline sampling with a pass@N strategy and Direct Preference Optimization (DPO) to improve generation quality; and (3)…
  • TaoSR1 significantly outperforms baselines on offline datasets and achieves substantial gains in online side-by-side human evaluations, introducing a novel paradigm for applying CoT reasoning to relevance classification.
Open paper
From Passive to Persuasive: Localized Activation Injection for Empathy and Negotiation

Niranjan Chebrolu, Kokil Jaidka, Gerard Christopher Yeo · Nov 16, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 95% Moderate protocol signal Freshness: Cold Status: Ready
Human Eval General
  • Evaluated on emotional dialogue and negotiation in both single- and multi-turn settings, localized injection consistently outperforms global steering and instruction priming; human evaluation confirms that gains reflect genuine improvements…
Open paper
Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning

Ran Xu, Jingjing Chen, Jiayu Ye, Yu Wu, Jun Yan, Carl Yang · Oct 27, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 95% Moderate protocol signal Freshness: Cold Status: Ready
Pairwise Preference Human Eval Coding
  • Motivated by the success of tool-integrated reasoning (TIR) in numerous tasks, we propose TIR-Judge, an end-to-end RL framework for training LLM judges that integrates a code executor for precise evaluation.
  • On seven public benchmarks, TIR-Judge surpasses strong reasoning-based judges by up to 6.4% (pointwise) and 7.7% (pairwise), and achieves listwise performance comparable to Claude-Opus-4 despite having only 8B parameters.
Open paper
Translation Asymmetry in LLMs as a Data Augmentation Factor: A Case Study for 6 Romansh Language Varieties

Jannis Vamvas, Ignacio Pérez Prat, Angela Heldstab, Dominic P. Fischer, Sina Ahmadi, Rico Sennrich · Mar 26, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 100% Moderate protocol signal Freshness: Warm Status: Fallback
Human EvalAutomatic Metrics Multilingual
  • A human evaluation confirms that our experiments yield the first model that generates fluent translations in the individual Romansh varieties.
Open paper
Evaluating LLM-Based Translation of a Low-Resource Technical Language: The Medical and Philosophical Greek of Galen

James L. Zainaldin, Cameron Pattison, Manuela Marai, Jacob Wu, Mark J. Schiefsky · Feb 27, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 100% Moderate protocol signal Freshness: Warm Status: Fallback
Human EvalAutomatic Metrics Multilingual
  • This study presents the first systematic, reference-free human evaluation of large language model (LLM) machine translation (MT) for Ancient Greek (AG) technical prose.
  • We assess translation quality using both standard automated evaluation metrics (BLEU, chrF++, METEOR, ROUGE-L, BERTScore, COMET, BLEURT) and expert human evaluation via a modified Multidimensional Quality Metrics (MQM) framework applied to…
Open paper
Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 100% Moderate protocol signal Freshness: Warm Status: Fallback
Human EvalAutomatic Metrics Law
  • Vichara surpasses existing judgment prediction benchmarks on both datasets, with GPT-4o mini achieving the highest performance (F1: 81.5 on PredEx, 80.3 on ILDC_expert), followed by Llama-3.1-8B.
  • Human evaluation of the generated explanations across Clarity, Linking, and Usefulness metrics highlights GPT-4o mini's superior interpretability.
Open paper
Peeking inside the Black-Box: Reinforcement Learning for Explainable and Accurate Relation Extraction

Xinyu Guo, Zhengliang Shi, Minglai Yang, Mahdi Rahimi, Mihai Surdeanu · Oct 7, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 95% Moderate protocol signal Freshness: Cold Status: Fallback
Human EvalAutomatic Metrics General
  • Finally, human evaluation shows that our best model generates relational keywords closely aligned with gold labels, increasing human explanation quality ratings by 54% (relative).
Open paper

Match reason: Matched by broad semantic/index fallback.

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready
Human Eval General
  • These systems commonly achieve performance levels comparable to or superior than trained human raters, but have frequently been demonstrated to be vulnerable to the influence of construct-irrelevant factors (i.e., features of responses that…
Open paper
Evaluating LLM-Based Grant Proposal Review via Structured Perturbations

William Thorne, Joseph James, Yang Wang, Chenghua Lin, Diana Maynard · Mar 9, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 52% Sparse protocol signal Freshness: Warm Status: Ready
Human Eval LawCoding
  • As AI-assisted grant proposals outpace manual review capacity in a kind of ``Malthusian trap'' for the research ecosystem, this paper investigates the capabilities and limitations of LLM-based grant reviewing for high-stakes evaluation.
  • Human evaluation shows LLM feedback is largely valid but skewed toward compliance checking over holistic assessment.
Open paper
DeEscalWild: A Real-World Benchmark for Automated De-Escalation Training with SLMs

Md Hasebul Hasan, Krity Haque Charu, Eshwara Prasad Sridhar, Shuchisnigdha Deb, Mohammad A. Islam · Mar 20, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 55% Moderate protocol signal Freshness: Warm Status: Fallback
Human EvalLlm As Judge LawCoding
  • To bridge this gap, we present DeEscalWild, a novel benchmark dataset curated from a multi-stage pipeline of in-the-wild police-civilian interactions extracted from publicly available video repositories.
  • Extensive experiments demonstrate that SLMs fine-tuned on this data significantly outperform their base counterparts across ROUGE-L, BLEU-4, METEOR, BERTScore, Realism Score, and human evaluation metrics.
Open paper
Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 53% Moderate protocol signal Freshness: Cold Status: Ready
Pairwise PreferenceCritique Edit Human Eval General
  • In this paper, we introduce RebuttalAgent, the first framework to ground academic rebuttal in Theory of Mind (ToM), operationalized through a ToM-Strategy-Response (TSR) framework that models reviewer mental state, formulates persuasion…
  • Extensive experiments show RebuttalAgent significantly outperforms the base model by an average of 18.3% on automated metrics, while also outperforming advanced proprietary models across both automated and human evaluations.
Open paper
HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue

Laya Iyer, Kriti Aggarwal, Sanmi Koyejo, Gail Heyman, Desmond C. Ong, Subhabrata Mukherjee · Jan 9, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 53% Moderate protocol signal Freshness: Cold Status: Ready
Pairwise PreferenceRubric Rating Human EvalLlm As Judge General
  • Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans.
  • We introduce HEART, the first-ever framework that directly compares humans and LLMs on the same multi-turn emotional-support conversations.
Open paper
Habibi: Laying the Open-Source Foundation of Unified-Dialectal Arabic Speech Synthesis

Yushen Chen, Junzhe Liu, Yujie Tu, Zhikang Niu, Yuzhe Liang, Chunyu Qiang · Jan 20, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 53% Moderate protocol signal Freshness: Cold Status: Fallback
Human Eval Long Horizon Coding
  • Key barriers include substantial cross-dialect lexical and phonological divergence, scarce synthesis-grade data, and the absence of a standardized multi-dialect evaluation benchmark.
  • We further release the first standardized multi-dialect Arabic TTS benchmark, comprising over 11,000 utterances across 7 dialect subsets with manually verified transcripts.
Open paper
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning

Minju Seo, Jinheon Baek, Seongyun Lee, Sung Ju Hwang · Apr 24, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 53% Moderate protocol signal Freshness: Cold Status: Fallback
Human Eval Multi Agent Coding
  • Inspired by this, we introduce PaperCoder, a multi-agent LLM framework that transforms machine learning papers into operational code repositories.
  • Moreover, each phase is instantiated through a set of specialized agents designed to collaborate effectively across the pipeline.
Open paper
EasyAnimate: High-Performance Video Generation Framework with Hybrid Windows Attention and Reward Backpropagation

Jiaqi Xu, Kunzhe Huang, Xinyi Zou, Yunkuo Chen, Bo Liu, MengLi Cheng · May 29, 2024

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 53% Moderate protocol signal Freshness: Cold Status: Fallback
Pairwise Preference Human Eval Coding
  • To enhance video generation quality, we optimize EasyAnimate using reward backpropagation to better align with human preferences.
  • The EasyAnimate achieves state-of-the-art performance on both the VBench leaderboard and human evaluation.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.