Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 4 Search mode: hybrid RSS
LFQA-HP-1M: A Large-Scale Human Preference Dataset for Long-Form Question Answering

Rafid Ishrak Jahan, Fahmid Shahriar Iqbal, Sagnik Ray Choudhury · Feb 27, 2026

Citations: 0
Pairwise PreferenceRubric Rating General
  • We present LFQA-HP-1M, a large-scale dataset comprising 1.3M human pairwise preference annotations for LFQA.
  • We propose nine rubrics for answer quality evaluation, and show that simple linear models based on these features perform comparably to state-of-the-art LLM evaluators.
MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages

Chenxi Whitehouse, Sebastian Ruder, Tony Lin, Oksana Kurylo, Haruka Takagi, Janice Lam · Sep 30, 2025

Citations: 0
Pairwise PreferenceRubric Rating Automatic Metrics Multilingual
  • To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms.
  • Additionally, we show that RL-trained judges can serve as generative reward models to enhance LLMs' multilingual proficiency, though discrepancies with human judgment remain.
gencat: Generative computerized adaptive testing

Wanyong Feng, Andrew Lan · Feb 23, 2026

Citations: 0
Pairwise Preference Coding
  • We train the model in a two-step process, first via Supervised Fine-Tuning and then via preference optimization for knowledge-response alignment.
AITutor-EvalKit: Exploring the Capabilities of AI Tutors

Numaan Naeem, Kaushal Kumar Maurya, Kseniia Petukhova, Ekaterina Kochmar · Dec 3, 2025

Citations: 0
Demonstrations General
  • We present AITutor-EvalKit, an application that uses language technology to evaluate the pedagogical quality of AI tutors, provides software for demonstration and evaluation, as well as model inspection and data visualization.

Protocol Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.