Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 10 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

Voxtral TTS

Mistral-AI, :, Alexander H. Liu, Alexis Tacnet, Andy Ehrenberg, Andy Lo · Mar 26, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 100% Moderate protocol signal Freshness: Hot Status: Fallback
Human EvalAutomatic Metrics Multilingual
  • In human evaluations conducted by native speakers, Voxtral TTS is preferred for multilingual voice cloning due to its naturalness and expressivity, achieving a 68.4\% win rate over ElevenLabs Flash v2.5.
Open paper
Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search

Xun Huang, Simeng Qin, Xiaoshuang Jia, Ranjie Duan, Huanqian Yan, Zhitao Zeng · Feb 26, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 81% Moderate protocol signal Freshness: Warm Status: Ready
Red Team Automatic Metrics Multilingual
  • Owing to its conciseness and obscurity, classical Chinese can partially bypass existing safety constraints, exposing notable vulnerabilities in LLMs.
  • To enhance readability and evaluation accuracy, we further design a classical Chinese to English translation module.
Open paper
Gender Bias in MT for a Genderless Language: New Benchmarks for Basque

Amaia Murillo, Olatz-Perez-de-Viñaspre, Naiara Perez · Mar 9, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 74% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise Preference Multilingual
  • WinoMTeus adapts the WinoMT benchmark to examine how gender-neutral Basque occupations are translated into gendered languages such as Spanish and French.
  • FLORES+Gender, in turn, extends the FLORES+ benchmark to assess whether translation quality varies when translating from gendered languages (Spanish and English) into Basque depending on the gender of the referent.
Open paper
EstLLM: Enhancing Estonian Capabilities in Multilingual LLMs via Continued Pretraining and Post-Training

Aleksei Dorkin, Taido Purason, Emil Kalbaliyev, Hele-Andra Kuulmets, Marii Ojastu, Mark Fišel · Mar 2, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 74% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise Preference MathCoding
  • We subsequently apply supervised fine-tuning, preference optimization, and chat vector merging to introduce robust instruction-following behavior.
  • Evaluation on a comprehensive suite of Estonian benchmarks shows consistent gains in linguistic competence, knowledge, reasoning, translation quality, and instruction-following compared to the original base model and its instruction-tuned…
Open paper

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 76% Moderate protocol signal Freshness: Cold Status: Fallback
Llm As JudgeAutomatic Metrics Multilingual
  • Our benchmark is built on a user manual for an agricultural machine, available in English, French, and German.
  • The evaluation focuses on realistic "needle-in-a-haystack" challenges and includes unanswerable questions to test for hallucinations.
Open paper
CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics

Vaibhav Devraj, Dhruv Kumar, Jagat Sesh Challa, Parth Agarwal, Navya Kommuri, Trizal Garg · Dec 26, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 58% High protocol signal Freshness: Warm Status: Ready
Expert Verification Automatic Metrics CodingMultilingual
  • To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data.
  • We evaluate six state-of-the-art models, including GPT-4o, Claude 3.7 Sonnet, and open-source models, using a strict evaluation protocol.
Open paper
Video-Based Reward Modeling for Computer-Use Agents

Linxin Song, Jieyu Zhang, Huanxin Sheng, Taiwei Shi, Gupta Rahul, Yang Liu · Mar 10, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Long Horizon Multilingual
  • Computer-using agents (CUAs) are becoming increasingly capable; however, it remains difficult to scale evaluation of whether a trajectory truly fulfills a user instruction.
  • In this work, we study reward modeling from execution video: a sequence of keyframes from an agent trajectory that is independent of the agent's internal reasoning or actions.
Open paper
ExpLang: Improved Exploration and Exploitation in LLM Reasoning with On-Policy Thinking Language Selection

Changjiang Gao, Zixian Huang, Kaichen Yang, Jiajun Chen, Jixing Li, Shujian Huang · Feb 25, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise Preference Multilingual
  • Analysis shows that, by enabling on-policy thinking language selection as an action during RL, ExpLang effectively extends the RL exploration space with diversified language preference and improves the RL exploitation outcome with leveraged…
Open paper
Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque

Oscar Sainz, Naiara Perez, Julen Etxaniz, Joseba Fernandez de Landa, Itziar Aldabe, Iker García-Ferrero · Jun 9, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 46% Sparse protocol signal Freshness: Cold Status: Fallback
Pairwise Preference CodingMultilingual
  • We present a comprehensive set of experiments for Basque that systematically study different combinations of these components evaluated on benchmarks and human preferences from 1,680 participants.
  • We release code, models, instruction datasets, and human preferences to support full reproducibility in future research on low-resource language adaptation.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.