Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 57 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics

Vaibhav Devraj, Dhruv Kumar, Jagat Sesh Challa, Parth Agarwal, Navya Kommuri, Trizal Garg · Dec 26, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 53% High protocol signal Freshness: Cold Status: Ready
Expert Verification Automatic Metrics CodingMultilingual
  • To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data.
  • We evaluate six state-of-the-art models, including GPT-4o, Claude 3.7 Sonnet, and open-source models, using a strict evaluation protocol.
Open paper
LaTeXTrans: Structured LaTeX Translation with Multi-Agent Coordination

Ziming Zhu, Chenglong Wang, Haosong Xv, Shunjie Xing, Yifu Huo, Fengning Tian · Aug 26, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 53% High protocol signal Freshness: Cold Status: Ready
Demonstrations Automatic Metrics Multi Agent MathCoding
  • In this paper, we introduce LaTeXTrans, a collaborative multi-agent system designed to address this challenge.
  • LaTeXTrans ensures format preservation, structural fidelity, and terminology consistency through six specialized agents: 1) a Parser that decomposes LaTeX into translation-friendly units via placeholder substitution and syntax filtering; 2)…
Open paper
MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining

Zhixun Chen, Ping Guo, Wenhan Han, Yifan Zhang, Binbin Liu, Haobin Lin · Jul 2, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 53% Moderate protocol signal Freshness: Cold Status: Ready
Pairwise Preference Automatic Metrics Multilingual
  • We introduce MuRating, a scalable framework that transfers high-quality English data-quality signals into a single rater for 17 target languages.
  • Compared to strong baselines, including QuRater, AskLLM, DCLM and so on, our approach boosts average accuracy on both English benchmarks and multilingual evaluations, with especially large gains on knowledge-intensive tasks.
Open paper
Penalizing Length: Uncovering Systematic Bias in Quality Estimation Metrics

Yilin Zhang, Wenda Xu, Zhongtao Liu, Tetsuji Nakagawa, Markus Freitag · Oct 24, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 50% Moderate protocol signal Freshness: Cold Status: Ready
Pairwise Preference Llm As Judge Multilingual
  • Quality Estimation (QE) metrics are vital in machine translation for reference-free evaluation and increasingly serve as selection criteria in data filtering and candidate reranking.
  • Through a systematic study of top-performing learned and LLM-as-a-Judge QE metrics across 10 diverse language pairs, we reveal two critical length biases: First, QE metrics consistently over-predict errors with increasing translation…
Open paper
MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages

Chenxi Whitehouse, Sebastian Ruder, Tony Lin, Oksana Kurylo, Haruka Takagi, Janice Lam · Sep 30, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 53% High protocol signal Freshness: Cold Status: Fallback
Pairwise PreferenceRubric Rating Automatic Metrics Multilingual
  • To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms.
  • Additionally, we show that RL-trained judges can serve as generative reward models to enhance LLMs' multilingual proficiency, though discrepancies with human judgment remain.
Open paper

Match reason: Matches selected tags (Multilingual).

Score: 53% Moderate protocol signal Freshness: Cold Status: Fallback
Llm As JudgeAutomatic Metrics Multilingual
  • Our benchmark is built on a user manual for an agricultural machine, available in English, French, and German.
  • The evaluation focuses on realistic "needle-in-a-haystack" challenges and includes unanswerable questions to test for hallucinations.
Open paper
RefTool: Reference-Guided Tool Creation for Knowledge-Intensive Reasoning

Xiao Liu, Da Yin, Zirui Wu, Yansong Feng · May 27, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 53% Moderate protocol signal Freshness: Cold Status: Fallback
Automatic Metrics Tool Use Multilingual
  • Experiments on causality, physics, and chemistry benchmarks demonstrate that RefTool outperforms existing tool-creation and domain-specific reasoning methods by 12.3% on average accuracy, while being cost-efficient and broadly generalizable…
Open paper
BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases

Mathew J. Koretsky, Maya Willey, Owen Bianchi, Chelsea X. Alvarado, Tanay Nayak, Nicole Kuznetsov · May 23, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 53% Moderate protocol signal Freshness: Cold Status: Fallback
Automatic Metrics Long Horizon MedicineCoding
  • We introduce BiomedSQL, the first benchmark explicitly designed to evaluate scientific reasoning in text-to-SQL generation over a real-world biomedical knowledge base.
  • Our results reveal a substantial performance gap: Gemini-3-Pro achieves 58.1% execution accuracy, while our custom multi-step agent, BMSQL, reaches 62.6%, both well below the expert baseline of 90.0%.
Open paper
World Simulation with Video Foundation Models for Physical AI

NVIDIA, :, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji · Oct 28, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 50% Moderate protocol signal Freshness: Cold Status: Fallback
Simulation Env Long Horizon CodingMultilingual
  • These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems.
  • To accelerate research and deployment in Physical AI, we release source code, pretrained checkpoints, and curated benchmarks under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-predict2.5 and…
Open paper
HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models

Zhaolu Kang, Junhao Gong, Jiaxu Yan, Wanke Xia, Yian Wang, Ziwen Wang · Jun 4, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 50% Moderate protocol signal Freshness: Cold Status: Fallback
Expert Verification Multilingual
  • However, current benchmarks for evaluating MLLMs primarily emphasize general knowledge and vertical step-by-step reasoning typical of STEM disciplines, while overlooking the distinct needs and potential of the Humanities and Social Sciences…
  • Addressing this gap, we present HSSBench, a dedicated benchmark designed to assess the capabilities of MLLMs on HSS tasks in multiple languages, including the six official languages of the United Nations.
Open paper
EuroGEST: Investigating gender stereotypes in multilingual language models

Jacqueline Rowe, Mateusz Klimaszewski, Liane Guillou, Shannon Vallor, Alexandra Birch · Jun 4, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 50% Moderate protocol signal Freshness: Cold Status: Fallback
Human EvalAutomatic Metrics Multilingual
  • Large language models increasingly support multiple languages, yet most benchmarks for gender bias remain English-centric.
  • Human evaluations confirm that our data generation method results in high accuracy of both translations and gender labels across languages.
Open paper
Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages

David Samuel, Lilja Øvrelid, Erik Velldal, Andrey Kutuzov · Dec 9, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 46% Sparse protocol signal Freshness: Cold Status: Fallback
Pairwise Preference Multilingual
  • Preference optimization is now a well-researched topic, but previous work has mostly addressed models for English and Chinese.
  • To address this, we focus on developing a fluent preference-aligned language model without any instruction-tuning data in the target language.
Open paper
Estonian Native Large Language Model Benchmark

Helena Grete Lillepalu, Tanel Alumäe · Oct 24, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 46% Sparse protocol signal Freshness: Cold Status: Fallback
Human EvalLlm As Judge Multilingual
  • The availability of LLM benchmarks for the Estonian language is limited, and a comprehensive evaluation comparing the performance of different LLMs on Estonian tasks has yet to be conducted.
  • We introduce a new benchmark for evaluating LLMs in Estonian, based on seven diverse datasets.
Open paper
Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque

Oscar Sainz, Naiara Perez, Julen Etxaniz, Joseba Fernandez de Landa, Itziar Aldabe, Iker García-Ferrero · Jun 9, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 46% Sparse protocol signal Freshness: Cold Status: Fallback
Pairwise Preference CodingMultilingual
  • We present a comprehensive set of experiments for Basque that systematically study different combinations of these components evaluated on benchmarks and human preferences from 1,680 participants.
  • We release code, models, instruction datasets, and human preferences to support full reproducibility in future research on low-resource language adaptation.
Open paper
Refusal Direction is Universal Across Safety-Aligned Languages

Xinpeng Wang, Mingyang Wang, Yihong Liu, Hinrich Schütze, Barbara Plank · May 22, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 46% Sparse protocol signal Freshness: Cold Status: Fallback
Red Team Multilingual
  • Refusal mechanisms in large language models (LLMs) are essential for ensuring safety.
  • In this paper, we investigate the refusal behavior in LLMs across 14 languages using PolyRefuse, a multilingual safety dataset created by translating malicious and benign English prompts into these languages.
Open paper
Logic Jailbreak: Efficiently Unlocking LLM Safety Restrictions Through Formal Logical Expression

Jingyu Peng, Maolin Wang, Nan Wang, Jiatong Li, Yuchen Li, Yuyang Ye · May 18, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 46% Sparse protocol signal Freshness: Cold Status: Fallback
Red Team Multilingual
  • To investigate this, we introduce LogiBreak, a novel and universal black-box jailbreak method that leverages logical expression translation to circumvent LLM safety systems.
  • We evaluate LogiBreak on a multilingual jailbreak dataset spanning three languages, demonstrating its effectiveness across various evaluation settings and linguistic contexts.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.