Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 57 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,735) General (557) Long Horizon (344) Pairwise Preference (298) Coding (234) Simulation Env (201) Multi Agent (199) Medicine (119) Llm As Judge (113) Expert Verification (102) Human Eval (92) Rubric Rating (85) Web Browsing (84) Math (82) Demonstrations (73) Red Team (67)

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Spend Less, Fit Better: Budget-Efficient Scaling Law Fitting via Active Experiment Selection
Apr 24, 2026 · Citations: 0

Across a diverse benchmark of scaling-law tasks, our method consistently outperforms classical design-based baselines, and often approaches the performance of fitting on the full experimental set while using only about 10% of the total…
How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks
Apr 24, 2026 · Citations: 0

In this paper, we present the first systematic study of token consumption patterns in agentic coding tasks.
Representational Harms in LLM-Generated Narratives Against Global Majority Nationalities
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
Apr 24, 2026 · Citations: 0

Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities.
Relaxation-Informed Training of Neural Network Surrogate Models
Apr 24, 2026 · Citations: 0

Experiments on non-convex benchmark functions and a two-stage stochastic programming problem with quantile neural network surrogates demonstrate that the proposed regularizers can reduce MILP solve times by up to four orders of magnitude…
An Undecidability Proof for the Plan Existence Problem
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Aligning Dense Retrievers with LLM Utility via DistillationAligning Dense Retrievers with LLM Utility via Distillation
Apr 24, 2026 · Citations: 0

On the QASPER benchmark, UAE improves retrieval Recall@1 by 30.59%, MAP by 30.16% and Token F1 by 17.3% over the strong semantic baseline BGE-Base.
Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Time-Localized Parametric Decomposition of Respiratory Airflow for Sub-Breath Analysis
Apr 24, 2026 · Citations: 0

Evaluation across 8,276 breaths demonstrates high reconstruction accuracy (mean squared error < 0.001 for four-component models) and robust parameter precision under moderate noise.
CRAFT: Clustered Regression for Adaptive Filtering of Training data
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics

Vaibhav Devraj, Dhruv Kumar, Jagat Sesh Challa, Parth Agarwal, Navya Kommuri, Trizal Garg · Dec 26, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 53% High protocol signal Freshness: Cold Status: Ready

Expert Verification Automatic Metrics CodingMultilingual

To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data.
We evaluate six state-of-the-art models, including GPT-4o, Claude 3.7 Sonnet, and open-source models, using a strict evaluation protocol.

Open paper

LaTeXTrans: Structured LaTeX Translation with Multi-Agent Coordination

Ziming Zhu, Chenglong Wang, Haosong Xv, Shunjie Xing, Yifu Huo, Fengning Tian · Aug 26, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 53% High protocol signal Freshness: Cold Status: Ready

Demonstrations Automatic Metrics Multi Agent MathCoding

In this paper, we introduce LaTeXTrans, a collaborative multi-agent system designed to address this challenge.
LaTeXTrans ensures format preservation, structural fidelity, and terminology consistency through six specialized agents: 1) a Parser that decomposes LaTeX into translation-friendly units via placeholder substitution and syntax filtering; 2)…

Open paper

MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining

Zhixun Chen, Ping Guo, Wenhan Han, Yifan Zhang, Binbin Liu, Haobin Lin · Jul 2, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 53% Moderate protocol signal Freshness: Cold Status: Ready

Pairwise Preference Automatic Metrics Multilingual

We introduce MuRating, a scalable framework that transfers high-quality English data-quality signals into a single rater for 17 target languages.
Compared to strong baselines, including QuRater, AskLLM, DCLM and so on, our approach boosts average accuracy on both English benchmarks and multilingual evaluations, with especially large gains on knowledge-intensive tasks.

Open paper

Penalizing Length: Uncovering Systematic Bias in Quality Estimation Metrics

Yilin Zhang, Wenda Xu, Zhongtao Liu, Tetsuji Nakagawa, Markus Freitag · Oct 24, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 50% Moderate protocol signal Freshness: Cold Status: Ready

Pairwise Preference Llm As Judge Multilingual

Quality Estimation (QE) metrics are vital in machine translation for reference-free evaluation and increasingly serve as selection criteria in data filtering and candidate reranking.
Through a systematic study of top-performing learned and LLM-as-a-Judge QE metrics across 10 diverse language pairs, we reveal two critical length biases: First, QE metrics consistently over-predict errors with increasing translation…

Open paper

MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages

Chenxi Whitehouse, Sebastian Ruder, Tony Lin, Oksana Kurylo, Haruka Takagi, Janice Lam · Sep 30, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 53% High protocol signal Freshness: Cold Status: Fallback

Pairwise PreferenceRubric Rating Automatic Metrics Multilingual

To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms.
Additionally, we show that RL-trained judges can serve as generative reward models to enhance LLMs' multilingual proficiency, though discrepancies with human judgment remain.

Open paper

Agri-Query: A Case Study on RAG vs. Long-Context LLMs for Cross-Lingual Technical Question Answering

Julius Gun, Timo Oksanen · Aug 25, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 53% Moderate protocol signal Freshness: Cold Status: Fallback

Llm As JudgeAutomatic Metrics Multilingual

Our benchmark is built on a user manual for an agricultural machine, available in English, French, and German.
The evaluation focuses on realistic "needle-in-a-haystack" challenges and includes unanswerable questions to test for hallucinations.

Open paper

RefTool: Reference-Guided Tool Creation for Knowledge-Intensive Reasoning

Xiao Liu, Da Yin, Zirui Wu, Yansong Feng · May 27, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 53% Moderate protocol signal Freshness: Cold Status: Fallback

Automatic Metrics Tool Use Multilingual

Experiments on causality, physics, and chemistry benchmarks demonstrate that RefTool outperforms existing tool-creation and domain-specific reasoning methods by 12.3% on average accuracy, while being cost-efficient and broadly generalizable…

Open paper

BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases

Mathew J. Koretsky, Maya Willey, Owen Bianchi, Chelsea X. Alvarado, Tanay Nayak, Nicole Kuznetsov · May 23, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 53% Moderate protocol signal Freshness: Cold Status: Fallback

Automatic Metrics Long Horizon MedicineCoding

We introduce BiomedSQL, the first benchmark explicitly designed to evaluate scientific reasoning in text-to-SQL generation over a real-world biomedical knowledge base.
Our results reveal a substantial performance gap: Gemini-3-Pro achieves 58.1% execution accuracy, while our custom multi-step agent, BMSQL, reaches 62.6%, both well below the expert baseline of 90.0%.

Open paper

World Simulation with Video Foundation Models for Physical AI

NVIDIA, :, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji · Oct 28, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 50% Moderate protocol signal Freshness: Cold Status: Fallback

Simulation Env Long Horizon CodingMultilingual

These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems.
To accelerate research and deployment in Physical AI, we release source code, pretrained checkpoints, and curated benchmarks under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-predict2.5 and…

Open paper

HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models

Zhaolu Kang, Junhao Gong, Jiaxu Yan, Wanke Xia, Yian Wang, Ziwen Wang · Jun 4, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 50% Moderate protocol signal Freshness: Cold Status: Fallback

Expert Verification Multilingual

However, current benchmarks for evaluating MLLMs primarily emphasize general knowledge and vertical step-by-step reasoning typical of STEM disciplines, while overlooking the distinct needs and potential of the Humanities and Social Sciences…
Addressing this gap, we present HSSBench, a dedicated benchmark designed to assess the capabilities of MLLMs on HSS tasks in multiple languages, including the six official languages of the United Nations.

Open paper

EuroGEST: Investigating gender stereotypes in multilingual language models

Jacqueline Rowe, Mateusz Klimaszewski, Liane Guillou, Shannon Vallor, Alexandra Birch · Jun 4, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 50% Moderate protocol signal Freshness: Cold Status: Fallback

Human EvalAutomatic Metrics Multilingual

Large language models increasingly support multiple languages, yet most benchmarks for gender bias remain English-centric.
Human evaluations confirm that our data generation method results in high accuracy of both translations and gender labels across languages.

Open paper

Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages

David Samuel, Lilja Øvrelid, Erik Velldal, Andrey Kutuzov · Dec 9, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 46% Sparse protocol signal Freshness: Cold Status: Fallback

Pairwise Preference Multilingual

Preference optimization is now a well-researched topic, but previous work has mostly addressed models for English and Chinese.
To address this, we focus on developing a fluent preference-aligned language model without any instruction-tuning data in the target language.

Open paper

Estonian Native Large Language Model Benchmark

Helena Grete Lillepalu, Tanel Alumäe · Oct 24, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 46% Sparse protocol signal Freshness: Cold Status: Fallback

Human EvalLlm As Judge Multilingual

The availability of LLM benchmarks for the Estonian language is limited, and a comprehensive evaluation comparing the performance of different LLMs on Estonian tasks has yet to be conducted.
We introduce a new benchmark for evaluating LLMs in Estonian, based on seven diverse datasets.

Open paper

Mapping Semantic & Syntactic Relationships with Geometric Rotation

Michael Freenor, Lauren Alvarez · Oct 10, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 46% Sparse protocol signal Freshness: Cold Status: Fallback

Demonstrations Multilingual

Open paper

Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque

Oscar Sainz, Naiara Perez, Julen Etxaniz, Joseba Fernandez de Landa, Itziar Aldabe, Iker García-Ferrero · Jun 9, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 46% Sparse protocol signal Freshness: Cold Status: Fallback

Pairwise Preference CodingMultilingual

We present a comprehensive set of experiments for Basque that systematically study different combinations of these components evaluated on benchmarks and human preferences from 1,680 participants.
We release code, models, instruction datasets, and human preferences to support full reproducibility in future research on low-resource language adaptation.

Open paper

Refusal Direction is Universal Across Safety-Aligned Languages

Xinpeng Wang, Mingyang Wang, Yihong Liu, Hinrich Schütze, Barbara Plank · May 22, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 46% Sparse protocol signal Freshness: Cold Status: Fallback

Red Team Multilingual

Refusal mechanisms in large language models (LLMs) are essential for ensuring safety.
In this paper, we investigate the refusal behavior in LLMs across 14 languages using PolyRefuse, a multilingual safety dataset created by translating malicious and benign English prompts into these languages.

Open paper

Logic Jailbreak: Efficiently Unlocking LLM Safety Restrictions Through Formal Logical Expression

Jingyu Peng, Maolin Wang, Nan Wang, Jiatong Li, Yuchen Li, Yuyang Ye · May 18, 2025

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 46% Sparse protocol signal Freshness: Cold Status: Fallback

Red Team Multilingual

To investigate this, we introduce LogiBreak, a novel and universal black-box jailbreak method that leverages logical expression translation to circumvent LLM safety systems.
We evaluate LogiBreak on a multilingual jailbreak dataset spanning three languages, demonstrating its effectiveness across various evaluation settings and linguistic contexts.

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent