Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 21 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study

Yosuke Yamagishi, Atsushi Takamatsu, Yasunori Hamaguchi, Tomohiro Kikuchi, Shouhei Hanaoka, Takeharu Yoshikawa · Apr 2, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Pairwise Preference Llm As JudgeAutomatic Metrics MedicineMultilingual
  • A board-certified radiologist and a radiology resident independently performed blinded pairwise evaluations across 4 criteria: terminology accuracy, readability, overall quality, and radiologist-style authenticity.
  • Radiologist 2 rated readability as equivalent in 75% of cases and favored the human-edited translation for overall quality (40% vs 21%).
Open paper
Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR

Ajinkya Kulkarni, Sandipana Dowerah, Atharva Kulkarni, Tanel Alumäe, Mathew Magimai Doss · Mar 6, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Long Horizon Multilingual
  • We present RAPTOR, Representation Aware Pairwise-gated Transformer for Out-of-domain Recognition a controlled study of compact SSL backbones from the HuBERT and WavLM within a unified pairwise-gated fusion detector, evaluated across 14…
Open paper
Citations: 0

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics CodingMultilingual
  • Yet safety pipelines, benchmarks, and alignment still largely target English and a handful of high-resource languages, implicitly assuming safety and factuality ''transfer'' across languages.
  • We synthesize recent findings indicating that (i) safety guardrails weaken sharply on low-resource and code-mixed inputs, (ii) culturally harmful behavior can persist even when standard toxicity scores look acceptable, and (iii)…
Open paper

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Long Horizon Multilingual
  • The methodological trajectory moves from classical supervised adaptation for task-specific demands to decoding-time alignment for safety, finally leveraging human feedback and preference modeling to achieve sociolinguistic acuity.
Open paper

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics Multilingual
  • Prior work has identified language-related neurons mainly through activation-based heuristics, which conflate language preference with functional importance.
  • Experiments on English, Chinese, and Vietnamese across multiple benchmarks, together with a dedicated relevance-based metric and base-to-chat model transfer analysis, show that CRANE isolates language-specific components more precisely than…
Open paper

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 58% Sparse protocol signal Freshness: Hot Status: Fallback
Pairwise Preference Multilingual
  • Large language models achieve strong performance on many language tasks, yet it remains unclear whether they integrate world knowledge with syntactic structure in a human-like, structure-sensitive way during ambiguity resolution.
  • In a speeded forced-choice comprehension experiment, humans show a large, correctly directed plausibility effect.
Open paper
MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining

Zhixun Chen, Ping Guo, Wenhan Han, Yifan Zhang, Binbin Liu, Haobin Lin · Jul 2, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 53% Moderate protocol signal Freshness: Cold Status: Ready
Pairwise Preference Automatic Metrics Multilingual
  • We introduce MuRating, a scalable framework that transfers high-quality English data-quality signals into a single rater for 17 target languages.
  • Compared to strong baselines, including QuRater, AskLLM, DCLM and so on, our approach boosts average accuracy on both English benchmarks and multilingual evaluations, with especially large gains on knowledge-intensive tasks.
Open paper
Cross-Preference Learning for Sentence-Level and Context-Aware Machine Translation

Ying Li, Xinglin Lyu, Junhui Li, Jinlong Yang, Hengchao Shang, Min Zhang · Mar 26, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise Preference Multilingual
  • In this paper, we propose Cross-Preference Learning (CPL), a preference-based training framework that explicitly captures the complementary benefits of sentence-level and context-aware MT.
  • CPL achieves this by integrating both intra- and cross-condition preferences into the preference optimization objective.
Open paper

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise Preference Multilingual
  • The quality of ``Multilingual KokoroChat'' was rigorously validated through human preference studies.
  • These evaluations confirmed that the translations produced by our ensemble method were preferred from any individual state-of-the-art LLM.
Open paper
Gender Bias in MT for a Genderless Language: New Benchmarks for Basque

Amaia Murillo, Olatz-Perez-de-Viñaspre, Naiara Perez · Mar 9, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise Preference Multilingual
  • WinoMTeus adapts the WinoMT benchmark to examine how gender-neutral Basque occupations are translated into gendered languages such as Spanish and French.
  • FLORES+Gender, in turn, extends the FLORES+ benchmark to assess whether translation quality varies when translating from gendered languages (Spanish and English) into Basque depending on the gender of the referent.
Open paper
EstLLM: Enhancing Estonian Capabilities in Multilingual LLMs via Continued Pretraining and Post-Training

Aleksei Dorkin, Taido Purason, Emil Kalbaliyev, Hele-Andra Kuulmets, Marii Ojastu, Mark Fišel · Mar 2, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise Preference MathCoding
  • We subsequently apply supervised fine-tuning, preference optimization, and chat vector merging to introduce robust instruction-following behavior.
  • Evaluation on a comprehensive suite of Estonian benchmarks shows consistent gains in linguistic competence, knowledge, reasoning, translation quality, and instruction-following compared to the original base model and its instruction-tuned…
Open paper
ExpLang: Improved Exploration and Exploitation in LLM Reasoning with On-Policy Thinking Language Selection

Changjiang Gao, Zixian Huang, Kaichen Yang, Jiajun Chen, Jixing Li, Shujian Huang · Feb 25, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise Preference Multilingual
  • Analysis shows that, by enabling on-policy thinking language selection as an action during RL, ExpLang effectively extends the RL exploration space with diversified language preference and improves the RL exploitation outcome with leveraged…
Open paper
Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment

Yuyan Bu, Xiaohao Liu, ZhaoXing Ren, Yaodong Yang, Juntao Dai · Feb 18, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise Preference Multilingual
  • The widespread deployment of large language models (LLMs) across linguistic communities necessitates reliable multilingual safety alignment.
  • In this work, we propose a resource-efficient method for improving multilingual safety alignment.
Open paper
A Parallel Cross-Lingual Benchmark for Multimodal Idiomaticity Understanding

Dilara Torunoğlu-Selamet, Dogukan Arslan, Rodrigo Wilkens, Wei He, Doruk Eryiğit, Thomas Pickard · Jan 13, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise Preference Multilingual
  • The dataset, containing 34 languages and over ten thousand items, allows comparative analyses of idiomatic patterns among language-specific realisations and preferences in order to gather insights about shared cultural aspects.
  • The result is a high-quality benchmark for evaluating multilingual and multimodal idiomatic language understanding.
Open paper
Enhancing Multilingual RAG Systems with Debiased Language Preference-Guided Query Fusion

Jeonghyun Park, Byeongjeong Kim, Seojin Hwang, Hwanhee Lee · Jan 6, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise Preference Multilingual
  • To address these biases, we propose DeLP (Debiased Language Preference), a calibrated metric designed to explicitly factor out these structural confounds.
  • Building on this insight, we introduce DELTA (DEbiased Language preference-guided Text Augmentation), a lightweight and efficient mRAG framework that strategically leverages monolingual alignment to optimize cross-lingual retrieval and…
Open paper
Penalizing Length: Uncovering Systematic Bias in Quality Estimation Metrics

Yilin Zhang, Wenda Xu, Zhongtao Liu, Tetsuji Nakagawa, Markus Freitag · Oct 24, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 50% Moderate protocol signal Freshness: Cold Status: Ready
Pairwise Preference Llm As Judge Multilingual
  • Quality Estimation (QE) metrics are vital in machine translation for reference-free evaluation and increasingly serve as selection criteria in data filtering and candidate reranking.
  • Through a systematic study of top-performing learned and LLM-as-a-Judge QE metrics across 10 diverse language pairs, we reveal two critical length biases: First, QE metrics consistently over-predict errors with increasing translation…
Open paper
MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages

Chenxi Whitehouse, Sebastian Ruder, Tony Lin, Oksana Kurylo, Haruka Takagi, Janice Lam · Sep 30, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 53% High protocol signal Freshness: Cold Status: Fallback
Pairwise PreferenceRubric Rating Automatic Metrics Multilingual
  • To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms.
  • Additionally, we show that RL-trained judges can serve as generative reward models to enhance LLMs' multilingual proficiency, though discrepancies with human judgment remain.
Open paper
Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages

David Samuel, Lilja Øvrelid, Erik Velldal, Andrey Kutuzov · Dec 9, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 46% Sparse protocol signal Freshness: Cold Status: Fallback
Pairwise Preference Multilingual
  • Preference optimization is now a well-researched topic, but previous work has mostly addressed models for English and Chinese.
  • To address this, we focus on developing a fluent preference-aligned language model without any instruction-tuning data in the target language.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.