Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 21 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,735) General (557) Long Horizon (344) Pairwise Preference (298) Coding (234) Simulation Env (201) Multi Agent (199) Medicine (119) Llm As Judge (113) Expert Verification (102) Human Eval (92) Rubric Rating (85) Web Browsing (84) Math (82) Demonstrations (73) Red Team (67)

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Spend Less, Fit Better: Budget-Efficient Scaling Law Fitting via Active Experiment Selection
Apr 24, 2026 · Citations: 0

Across a diverse benchmark of scaling-law tasks, our method consistently outperforms classical design-based baselines, and often approaches the performance of fitting on the full experimental set while using only about 10% of the total…
How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks
Apr 24, 2026 · Citations: 0

In this paper, we present the first systematic study of token consumption patterns in agentic coding tasks.
Representational Harms in LLM-Generated Narratives Against Global Majority Nationalities
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
Apr 24, 2026 · Citations: 0

Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities.
Relaxation-Informed Training of Neural Network Surrogate Models
Apr 24, 2026 · Citations: 0

Experiments on non-convex benchmark functions and a two-stage stochastic programming problem with quantile neural network surrogates demonstrate that the proposed regularizers can reduce MILP solve times by up to four orders of magnitude…
An Undecidability Proof for the Plan Existence Problem
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Aligning Dense Retrievers with LLM Utility via DistillationAligning Dense Retrievers with LLM Utility via Distillation
Apr 24, 2026 · Citations: 0

On the QASPER benchmark, UAE improves retrieval Recall@1 by 30.59%, MAP by 30.16% and Token F1 by 17.3% over the strong semantic baseline BGE-Base.
Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Time-Localized Parametric Decomposition of Respiratory Airflow for Sub-Breath Analysis
Apr 24, 2026 · Citations: 0

Evaluation across 8,276 breaths demonstrates high reconstruction accuracy (mean squared error < 0.001 for four-component models) and robust parameter precision under moderate noise.
CRAFT: Clustered Regression for Adaptive Filtering of Training data
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study

Yosuke Yamagishi, Atsushi Takamatsu, Yasunori Hamaguchi, Tomohiro Kikuchi, Shouhei Hanaoka, Takeharu Yoshikawa · Apr 2, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready

Pairwise Preference Llm As JudgeAutomatic Metrics MedicineMultilingual

A board-certified radiologist and a radiology resident independently performed blinded pairwise evaluations across 4 criteria: terminology accuracy, readability, overall quality, and radiologist-style authenticity.
Radiologist 2 rated readability as equivalent in 75% of cases and favored the human-edited translation for overall quality (40% vs 21%).

Open paper

Semantic Alignment across Ancient Egyptian Language Stages via Normalization-Aware Multitask Learning

He Huang · Mar 25, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready

Pairwise Preference Automatic Metrics Multilingual

Open paper

Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR

Ajinkya Kulkarni, Sandipana Dowerah, Atharva Kulkarni, Tanel Alumäe, Mathew Magimai Doss · Mar 6, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready

Pairwise Preference Long Horizon Multilingual

We present RAPTOR, Representation Aware Pairwise-gated Transformer for Out-of-domain Recognition a controlled study of compact SSL backbones from the HuBERT and WavLM within a unified pairwise-gated fusion detector, evaluated across 14…

Open paper

Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages

Somnath Banerjee, Rima Hazra, Animesh Mukherjee · Feb 14, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready

Pairwise Preference Automatic Metrics CodingMultilingual

Yet safety pipelines, benchmarks, and alignment still largely target English and a handful of high-resource languages, implicitly assuming safety and factuality ''transfer'' across languages.
We synthesize recent findings indicating that (i) safety guardrails weaken sharply on low-resource and code-mixed inputs, (ii) culturally harmful behavior can persist even when standard toxicity scores look acceptable, and (iii)…

Open paper

Tutoring Large Language Models to be Domain-adaptive, Precise, and Safe

Somnath Banerjee · Feb 14, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready

Pairwise Preference Long Horizon Multilingual

The methodological trajectory moves from classical supervised adaptation for task-specific demands to decoding-time alignment for safety, finally leveraging human feedback and preference modeling to achieve sociolinguistic acuity.

Open paper

CRANE: Causal Relevance Analysis of Language-Specific Neurons in Multilingual Large Language Models

Yifan Le, Yunliang Li · Jan 8, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready

Pairwise Preference Automatic Metrics Multilingual

Prior work has identified language-related neurons mainly through activation-based heuristics, which conflate language preference with functional importance.
Experiments on English, Chinese, and Vietnamese across multiple benchmarks, together with a dedicated relevance-based metric and base-to-chat model transfer analysis, show that CRANE isolates language-specific components more precisely than…

Open paper

Plausibility as Commonsense Reasoning: Humans Succeed, Large Language Models Do not

Sercan Karakaş · Apr 6, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 58% Sparse protocol signal Freshness: Hot Status: Fallback

Pairwise Preference Multilingual

Large language models achieve strong performance on many language tasks, yet it remains unclear whether they integrate world knowledge with syntactic structure in a human-like, structure-sensitive way during ambiguity resolution.
In a speeded forced-choice comprehension experiment, humans show a large, correctly directed plausibility effect.

Open paper

Rethinking Metrics for Lexical Semantic Change Detection

Roksana Goworek, Haim Dubossarsky · Feb 17, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready

Pairwise Preference Automatic Metrics CodingMultilingual

Open paper

MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining

Zhixun Chen, Ping Guo, Wenhan Han, Yifan Zhang, Binbin Liu, Haobin Lin · Jul 2, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 53% Moderate protocol signal Freshness: Cold Status: Ready

Pairwise Preference Automatic Metrics Multilingual

We introduce MuRating, a scalable framework that transfers high-quality English data-quality signals into a single rater for 17 target languages.
Compared to strong baselines, including QuRater, AskLLM, DCLM and so on, our approach boosts average accuracy on both English benchmarks and multilingual evaluations, with especially large gains on knowledge-intensive tasks.

Open paper

Cross-Preference Learning for Sentence-Level and Context-Aware Machine Translation

Ying Li, Xinglin Lyu, Junhui Li, Jinlong Yang, Hengchao Shang, Min Zhang · Mar 26, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback

Pairwise Preference Multilingual

In this paper, we propose Cross-Preference Learning (CPL), a preference-based training framework that explicitly captures the complementary benefits of sentence-level and context-aware MT.
CPL achieves this by integrating both intra- and cross-condition preferences into the preference optimization objective.

Open paper

Multilingual KokoroChat: A Multi-LLM Ensemble Translation Method for Creating a Multilingual Counseling Dialogue Dataset

Ryoma Suzuki, Zhiyang Qi, Michimasa Inaba · Mar 24, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback

Pairwise Preference Multilingual

The quality of ``Multilingual KokoroChat'' was rigorously validated through human preference studies.
These evaluations confirmed that the translations produced by our ensemble method were preferred from any individual state-of-the-art LLM.

Open paper

Gender Bias in MT for a Genderless Language: New Benchmarks for Basque

Amaia Murillo, Olatz-Perez-de-Viñaspre, Naiara Perez · Mar 9, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback

Pairwise Preference Multilingual

WinoMTeus adapts the WinoMT benchmark to examine how gender-neutral Basque occupations are translated into gendered languages such as Spanish and French.
FLORES+Gender, in turn, extends the FLORES+ benchmark to assess whether translation quality varies when translating from gendered languages (Spanish and English) into Basque depending on the gender of the referent.

Open paper

EstLLM: Enhancing Estonian Capabilities in Multilingual LLMs via Continued Pretraining and Post-Training

Aleksei Dorkin, Taido Purason, Emil Kalbaliyev, Hele-Andra Kuulmets, Marii Ojastu, Mark Fišel · Mar 2, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback

Pairwise Preference MathCoding

We subsequently apply supervised fine-tuning, preference optimization, and chat vector merging to introduce robust instruction-following behavior.
Evaluation on a comprehensive suite of Estonian benchmarks shows consistent gains in linguistic competence, knowledge, reasoning, translation quality, and instruction-following compared to the original base model and its instruction-tuned…

Open paper

ExpLang: Improved Exploration and Exploitation in LLM Reasoning with On-Policy Thinking Language Selection

Changjiang Gao, Zixian Huang, Kaichen Yang, Jiajun Chen, Jixing Li, Shujian Huang · Feb 25, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback

Pairwise Preference Multilingual

Analysis shows that, by enabling on-policy thinking language selection as an action during RL, ExpLang effectively extends the RL exploration space with diversified language preference and improves the RL exploitation outcome with leveraged…

Open paper

Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment

Yuyan Bu, Xiaohao Liu, ZhaoXing Ren, Yaodong Yang, Juntao Dai · Feb 18, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback

Pairwise Preference Multilingual

The widespread deployment of large language models (LLMs) across linguistic communities necessitates reliable multilingual safety alignment.
In this work, we propose a resource-efficient method for improving multilingual safety alignment.

Open paper

A Parallel Cross-Lingual Benchmark for Multimodal Idiomaticity Understanding

Dilara Torunoğlu-Selamet, Dogukan Arslan, Rodrigo Wilkens, Wei He, Doruk Eryiğit, Thomas Pickard · Jan 13, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback

Pairwise Preference Multilingual

The dataset, containing 34 languages and over ten thousand items, allows comparative analyses of idiomatic patterns among language-specific realisations and preferences in order to gather insights about shared cultural aspects.
The result is a high-quality benchmark for evaluating multilingual and multimodal idiomatic language understanding.

Open paper

Enhancing Multilingual RAG Systems with Debiased Language Preference-Guided Query Fusion

Jeonghyun Park, Byeongjeong Kim, Seojin Hwang, Hwanhee Lee · Jan 6, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback

Pairwise Preference Multilingual

To address these biases, we propose DeLP (Debiased Language Preference), a calibrated metric designed to explicitly factor out these structural confounds.
Building on this insight, we introduce DELTA (DEbiased Language preference-guided Text Augmentation), a lightweight and efficient mRAG framework that strategically leverages monolingual alignment to optimize cross-lingual retrieval and…

Open paper

Penalizing Length: Uncovering Systematic Bias in Quality Estimation Metrics

Yilin Zhang, Wenda Xu, Zhongtao Liu, Tetsuji Nakagawa, Markus Freitag · Oct 24, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 50% Moderate protocol signal Freshness: Cold Status: Ready

Pairwise Preference Llm As Judge Multilingual

Quality Estimation (QE) metrics are vital in machine translation for reference-free evaluation and increasingly serve as selection criteria in data filtering and candidate reranking.
Through a systematic study of top-performing learned and LLM-as-a-Judge QE metrics across 10 diverse language pairs, we reveal two critical length biases: First, QE metrics consistently over-predict errors with increasing translation…

Open paper

MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages

Chenxi Whitehouse, Sebastian Ruder, Tony Lin, Oksana Kurylo, Haruka Takagi, Janice Lam · Sep 30, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 53% High protocol signal Freshness: Cold Status: Fallback

Pairwise PreferenceRubric Rating Automatic Metrics Multilingual

To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms.
Additionally, we show that RL-trained judges can serve as generative reward models to enhance LLMs' multilingual proficiency, though discrepancies with human judgment remain.

Open paper

Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages

David Samuel, Lilja Øvrelid, Erik Velldal, Andrey Kutuzov · Dec 9, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual, Pairwise Preference).

Score: 46% Sparse protocol signal Freshness: Cold Status: Fallback

Pairwise Preference Multilingual

Preference optimization is now a well-researched topic, but previous work has mostly addressed models for English and Chinese.
To address this, we focus on developing a fluent preference-aligned language model without any instruction-tuning data in the target language.

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent