Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 736 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,620) General (530) Long Horizon (320) Pairwise Preference (288) Coding (218) Simulation Env (187) Multi Agent (182) Medicine (116) Llm As Judge (107) Expert Verification (97) Human Eval (89) Rubric Rating (82) Web Browsing (79) Math (77) Demonstrations (67) Critique Edit (63)

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

Punctuated Equilibria in Artificial Intelligence: The Institutional Scaling Law and the Speciation of Sovereign AI

Mark Baciak, Thomas A. Cellucci, Deanna M. Falkowski · Mar 15, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics MathLaw

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

SommBench: Assessing Sommelier Expertise of Language Models

William Brach, Tomas Bedej, Jacob Nielsen, Jacob Pichna, Juraj Bedej, Eemeli Saarensilta · Mar 12, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 77% Sparse protocol signal Freshness: Warm Status: Ready

Multilingual

Previous cultural evaluation benchmarks focus mainly on basic cultural knowledge that can be encoded in linguistic form.
Here, we propose SommBench, a multilingual benchmark to assess sommelier expertise, a domain deeply grounded in the senses of smell and taste.

Open paper

Criterion-referenceability determines LLM-as-a-judge validity across physics assessment formats

Will Yeadon, Tom Hardy, Paul Mackay, Elise Agra · Mar 16, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Fallback

Llm As JudgeAutomatic Metrics Coding

We evaluate LLM-as-a-judge marking across three physics assessment formats - structured questions, written essays, and scientific plots - comparing GPT-5.2, Grok 4.1, Claude Opus 4.5, DeepSeek-V3.2, Gemini Pro 3, and committee aggregations…
Across n=55 scripts (n=275 essays), blind AI marking is harsher and more variable than human marking, with discriminative validity already poor (ρ\approx 0.1).

Open paper

CausalEvolve: Towards Open-Ended Discovery with Causal Scratchpad

Yongqiang Chen, Chenxi Liu, Zhenhao Chen, Tongliang Liu, Bo Han, Kun Zhang · Mar 15, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 73% Sparse protocol signal Freshness: Warm Status: Ready

General

Evolve-based agent such as AlphaEvolve is one of the notable successes in using Large Language Models (LLMs) to build AI Scientists.
These agents tackle open-ended scientific problems by iteratively improving and evolving programs, leveraging the prior knowledge and reasoning capabilities of LLMs.

Open paper

MALicious INTent Dataset and Inoculating LLMs for Enhanced Disinformation Detection

Arkadiusz Modzelewski, Witold Sosnowski, Eleni Papadopulos, Elisa Sartori, Tiziano Labruna, Giovanni Da San Martino · Mar 15, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 73% Sparse protocol signal Freshness: Warm Status: Ready

General

This work presents MALINT, the first human-annotated English corpus developed in collaboration with expert fact-checkers to capture disinformation and its malicious intent.
We utilize our novel corpus to benchmark 12 language models, including small language models (SLMs) such as BERT and large language models (LLMs) like Llama 3.3, on binary and multilabel intent classification tasks.

Open paper

Design and evaluation of an agentic workflow for crisis-related synthetic tweet datasets

Roben Delos Reyes, Timothy Douglas, Asanobu Kitamoto · Mar 13, 2026

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 73% Sparse protocol signal Freshness: Warm Status: Ready

General

These limitations constrain the development and evaluation of AI systems used in crisis informatics.
To address these limitations, we introduce an agentic workflow for generating crisis-related synthetic tweet datasets.

Open paper

Long-Context Encoder Models for Polish Language Understanding

Sławomir Dadas, Rafał Poświata, Marek Kozłowski, Małgorzata Grębowiec, Michał Perełkiewicz, Paweł Klimiuk · Mar 12, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics Multilingual

The models were evaluated on 25 tasks, including the KLEJ benchmark, a newly introduced financial task suite (FinBench), and other classification and regression tasks, specifically those requiring long-document understanding.

Open paper

NepTam: A Nepali-Tamang Parallel Corpus and Baseline Machine Translation Experiments

Rupak Raj Ghimire, Bipesh Subedi, Balaram Prasain, Prakash Poudyal, Praveen Acharya, Nischal Karki · Mar 14, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics Multilingual

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Supervised Fine-Tuning versus Reinforcement Learning: A Study of Post-Training Methods for Large Language Models

Haitao Jiang, Wenbo Zhang, Jiarui Yao, Hengrui Cai, Sheng Wang, Rui Song · Mar 14, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

NCCL EP: Towards a Unified Expert Parallel Communication API for NCCL

Amos Goldman, Nimrod Boker, Maayan Sheraizin, Nimrod Admoni, Artem Polyakov, Subhadeep Bhattacharya · Mar 13, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Beyond Creed: A Non-Identity Safety Condition A Strong Empirical Alternative to Identity Framing in Low-Data LoRA Fine-Tuning

Xinran Zhang · Mar 16, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 32% Sparse protocol signal Freshness: Warm Status: Ready

General

How safety supervision is written may matter more than the explicit identity content it contains.
Across three instruction-tuned model families (Llama 3.1 8B, Qwen2.5 7B, and Gemma 3 4B), we evaluate HarmBench using a reconciled dual-judge pipeline combining Bedrock-hosted DeepSeek v3.2 and Sonnet 4.6, with disagreement and boundary…

Open paper

GameUIAgent: An LLM-Powered Framework for Automated Game UI Design with Structured Intermediate Representation

Wei Zeng, Fengwei An, Zhen Liu, Jian Zhao · Mar 16, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 28% Sparse protocol signal Freshness: Warm Status: Ready

Law

We present GameUIAgent, an LLM-powered agentic framework that translates natural language descriptions into editable Figma designs via a Design Spec JSON intermediate representation.
A Rendering-Evaluation Fidelity Principle reveals that partial rendering enhancements paradoxically degrade VLM evaluation by amplifying structural defects.

Open paper

Towards Next-Generation LLM Training: From the Data-Centric Perspective

Hao Liang, Zhengyang Zhao, Zhaoyang Han, Meiyi Qiang, Xiaochen Ma, Bohan Zeng · Mar 16, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 28% Sparse protocol signal Freshness: Warm Status: Ready

General

In current practice, LLM training data is often constructed using ad hoc scripts, and there is still a lack of mature, agent-based data preparation systems that can automatically construct robust and reusable data workflows, thereby freeing…
First, we propose building a robust, agent-based automatic data preparation system that supports automated workflow construction and scalable data management.

Open paper

Seamless Deception: Larger Language Models Are Better Knowledge Concealers

Dhananjay Ashok, Ruth-Ann Armstrong, Jonathan May · Mar 15, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 28% Sparse protocol signal Freshness: Warm Status: Ready

General

Initial findings on smaller models show that classifiers can detect concealment more reliably than human evaluators, with gradient-based concealment proving easier to identify than prompt-based methods.

Open paper

Multilingual TinyStories: A Synthetic Combinatorial Corpus of Indic Children's Stories for Training Small Language Models

Deepon Halder, Angira Mukherjee · Mar 15, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 28% Sparse protocol signal Freshness: Warm Status: Ready

Multilingual

Designed specifically for the training and evaluation of Small Language Models (SLMs), the corpus provides simple, narrative-driven text strictly localized to native scripts.

Open paper

Questionnaire Responses Do not Capture the Safety of AI Agents

Max Hellrigel-Holderbaum, Edward James Young · Mar 15, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 28% Sparse protocol signal Freshness: Warm Status: Ready

General

As AI systems advance in capabilities, measuring their safety and alignment to human values is becoming paramount.
By focusing on unaugmented LLMs, they fall short of evaluating AI agents, which could actually perform relevant behaviors, hence posing much greater risks.

Open paper

Understanding the Emergence of Seemingly Useless Features in Next-Token Predictors

Mark Rofin, Jalal Naghiyev, Michael Hahn · Mar 14, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 28% Sparse protocol signal Freshness: Warm Status: Ready

Coding

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

GhanaNLP Parallel Corpora: Comprehensive Multilingual Resources for Low-Resource Ghanaian Languages

Lawrence Adu Gyamfi, Paul Azunre, Stephen Edward Moore, Joel Budu, Akwasi Asare, Mich-Seth Owusu · Mar 14, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 28% Sparse protocol signal Freshness: Warm Status: Ready

Multilingual

The data were collected, translated, and annotated by human professionals and enriched with standard structural metadata to ensure consistency and usability.
This paper documents the dataset creation methodology, structure, intended use cases, and evaluation, as well as their deployment in real world applications such as the Khaya AI translation engine.

Open paper

LiveWeb-IE: A Benchmark For Online Web Information Extraction

Seungbin Yang, Jihwan Kim, Jaemin Choi, Dongjin Kim, Soyoung Yang, ChaeHun Park · Mar 14, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 28% Sparse protocol signal Freshness: Warm Status: Ready

General

To bridge this gap, we introduce \dataset, a new benchmark designed for evaluating WIE systems directly against live websites.
In addition, we propose Visual Grounding Scraper (VGS), a novel multi-stage agentic framework that mimics human cognitive processes by visually narrowing down web page content to extract desired information.

Open paper

The AI Fiction Paradox

Katherine Elkins · Mar 13, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 28% Sparse protocol signal Freshness: Warm Status: Ready

Law

Second, I identify an informational revaluation challenge: fiction systematically violates the computational assumption that informational importance aligns with statistical salience, requiring readers and models alike to retrospectively…
Fiction concentrates uniquely powerful cognitive and emotional patterns for modeling human behavior, and mastery of these patterns by AI systems would represent not just a creative achievement but a potent vehicle for human manipulation at…

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent