Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 128 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,610) General (530) Long Horizon (319) Pairwise Preference (287) Coding (216) Simulation Env (186) Multi Agent (182) Medicine (115) Llm As Judge (106) Expert Verification (97) Human Eval (89) Rubric Rating (82) Web Browsing (79) Math (77) Demonstrations (67) Critique Edit (63)

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, Yun-Nung Chen · Apr 8, 2026

Citations: 0

Match reason: Matches selected tags (Red Team).

Score: 55% High protocol signal Freshness: Hot Status: Ready

Red Team Automatic Metrics Long Horizon General

As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces.
To address this gap, we introduce TraceSafe-Bench, the first comprehensive benchmark specifically designed to assess mid-trajectory safety.

Open paper

State-of-the-Art Arabic Language Modeling with Sparse MoE Fine-Tuning and Chain-of-Thought Distillation

Navan Preet Singh, Anurag Garikipati, Ahmed Abulkhair, Jyani Akshay Jagdishbhai, Atul Yaduvanshi, Amarendra Chaudhary · Apr 7, 2026

Citations: 0

Match reason: Matches selected tags (Demonstrations).

Score: 55% Moderate protocol signal Freshness: Hot Status: Ready

Demonstrations Automatic Metrics General

Arabic-DeepSeek-R1 achieves the highest average score across the seven-benchmark OALL suite while establishing SOTA or near-SOTA, including dominant results on grammar-focused MadinahQA (surpassing both GPT-5.1 and the OALL leader by…
Our results indicate that the combination of sparse MoE architecture, culturally-informed CoT distillation with explicit Arabic linguistic checks, and strategic bilingual data curation enables an open-source adapted model to systematically…

Open paper

Do No Harm: Exposing Hidden Vulnerabilities of LLMs via Persona-based Client Simulation Attack in Psychological Counseling

Qingyang Xu, Yaling Shen, Stephanie Fong, Zimu Wang, Yiwen Jiang, Xiangyu Zhao · Apr 6, 2026

Citations: 0

Match reason: Matches selected tags (Red Team).

Score: 55% Moderate protocol signal Freshness: Hot Status: Ready

Red Team Simulation Env Medicine

The increasing use of large language models (LLMs) in mental healthcare raises safety concerns in high-stakes therapeutic interactions.
To address this gap, we introduce Personality-based Client Simulation Attack (PCSA), the first red-teaming framework that simulates clients in psychological counseling through coherent, persona-driven client dialogues to expose…

Open paper

SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning

Philip Schroeder, Thomas Weng, Karl Schmeckpeper, Eric Rosen, Stephen Hart, Ondrej Biza · Mar 30, 2026

Citations: 0

Match reason: Matches selected tags (Demonstrations).

Score: 55% Moderate protocol signal Freshness: Hot Status: Ready

Demonstrations Simulation Env Long Horizon General

Open paper

Prompt Attack Detection with LLM-as-a-Judge and Mixture-of-Models

Hieu Xuan Le, Benjamin Goh, Quy Anh Tang · Mar 26, 2026

Citations: 0

Match reason: Matches selected tags (Red Team).

Score: 55% Moderate protocol signal Freshness: Hot Status: Ready

Red Team Llm As Judge General

In production, guardrails must mitigate these attacks under strict low-latency constraints, resulting in a deployment gap in which lightweight classifiers and rule-based systems struggle to generalize under distribution shift, while…
In this work, we examine whether lightweight, general-purpose LLMs can reliably serve as security judges under real-world production constraints.

Open paper

AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling

Liang Ding · Mar 22, 2026

Citations: 0

Match reason: Matches selected tags (Demonstrations).

Score: 55% High protocol signal Freshness: Hot Status: Ready

Demonstrations Human EvalLlm As Judge Long Horizon General

LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely…
We introduce AgentHER, a framework that recovers this lost training signal by adapting the Hindsight Experience Replay (HER; Andrychowicz et al., 2017) principle to natural-language agent trajectories for offline data augmentation.

Open paper

Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation

Swagat Padhan, Lakshya Jain, Bhavya Minesh Shah, Omkar Patil, Thao Nguyen, Nakul Gopalan · Mar 19, 2026

Citations: 0

Match reason: Matches selected tags (Demonstrations).

Score: 55% High protocol signal Freshness: Hot Status: Ready

Demonstrations Simulation Env Multi Agent General

To address this limitation, we propose MAPG (Multi-Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each component.
We evaluate MAPG on the HM-EQA benchmark and show consistent performance improvements over strong baselines.

Open paper

Exposing Long-Tail Safety Failures in Large Language Models through Efficient Diverse Response Sampling

Suvadeep Hajra, Palash Nandi, Tanmoy Chakraborty · Mar 15, 2026

Citations: 0

Match reason: Matches selected tags (Red Team).

Score: 55% Moderate protocol signal Freshness: Hot Status: Ready

Red Team Automatic Metrics General

While most red-teaming work emphasizes adversarial prompt search (input-space optimization), we show that safety failures can also be systematically exposed through diverse response generation (output-space exploration) for a fixed…
Across multiple jailbreak benchmarks and open-source LLMs, PDPS achieves attack success rates comparable to large-scale IID sampling while using only 8% to 29% of the computational cost.

Open paper

SemEval-2026 Task 6: CLARITY -- Unmasking Political Question Evasions

Konstantinos Thomas, Giorgos Filandrianos, Maria Lymperaiou, Chrysoula Zerva, Giorgos Stamou · Mar 14, 2026

Citations: 0

Match reason: Matches selected tags (Red Team).

Score: 55% High protocol signal Freshness: Hot Status: Ready

Red Team Automatic Metrics General

The benchmark is constructed from U.S.
CLARITY establishes political response evasion as a challenging benchmark for computational discourse analysis and highlights the difficulty of modeling strategic ambiguity in political language.

Open paper

A Framework for Closed-Loop Robotic Assembly, Alignment and Self-Recovery of Precision Optical Systems

Seou Choi, Sachin Vaidya, Caio Silva, Shiekh Zia Uddin, Sajib Biswas Shuvo, Shrish Choudhary · Mar 23, 2026

Citations: 0

Match reason: Matches selected tags (Demonstrations).

Score: 52% Moderate protocol signal Freshness: Hot Status: Fallback

Demonstrations General

Open paper

In-Context Learning in Speech Language Models: Analyzing the Role of Acoustic Features, Linguistic Structure, and Induction Heads

Charlotte Pouw, Hosein Mohebbi, Afra Alishahi, Willem Zuidema · Apr 7, 2026

Citations: 0

Match reason: Matches selected tags (Demonstrations).

Score: 48% Sparse protocol signal Freshness: Hot Status: Fallback

Demonstrations General

Open paper

Exclusive Unlearning

Mutsumi Sasaki, Kouta Nakayama, Yusuke Miyao, Yohei Oseki, Masaru Isonuma · Apr 7, 2026

Citations: 0

Match reason: Matches selected tags (Red Team).

Score: 48% Sparse protocol signal Freshness: Hot Status: Fallback

Red Team Math

We demonstrate that through Exclusive Unlearning, it is possible to obtain a model that ensures safety against a wide range of inputs, including jailbreaks, while maintaining the ability to respond to diverse instructions related to…

Open paper

Epistemic Blinding: An Inference-Time Protocol for Auditing Prior Contamination in LLM-Assisted Analysis

Michael Cuccarese · Apr 7, 2026

Citations: 0

Match reason: Matches selected tags (Demonstrations).

Score: 48% Sparse protocol signal Freshness: Hot Status: Fallback

Demonstrations Coding

This paper presents epistemic blinding in the context of an agentic system that uses large language models to reason across multiple biological datasets for drug target prioritization.
The complete target identification system is described - including LLM-guided evolutionary optimization of scoring functions and blinded agentic reasoning for target rationalization - with demonstration that both stages operate without…

Open paper

A Survey of On-Policy Distillation for Large Language Models

Mingyang Song, Mao Zheng · Apr 1, 2026

Citations: 0

Match reason: Matches selected tags (Demonstrations).

Score: 48% Sparse protocol signal Freshness: Hot Status: Fallback

Expert VerificationDemonstrations Law

We systematically analyze representative methods, examine industrial deployments, and identify open problems including distillation scaling laws, uncertainty-aware feedback, and agent-level distillation.

Open paper

Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

Bilgehan Sel, Xuanli He, Alwin Peng, Ming Jin, Jerry Wei · Mar 30, 2026

Citations: 0

Match reason: Matches selected tags (Red Team).

Score: 48% Sparse protocol signal Freshness: Hot Status: Fallback

Red Team General

Fine-tuning APIs offered by major AI providers create new attack surfaces where adversaries can bypass safety measures through targeted fine-tuning.
Crucially, while prior adversarial fine-tuning approaches report more than 25% capability degradation on reasoning benchmarks, Trojan-Speak incurs less than 5% degradation while achieving 99+% classifier evasion for models with 14B+…

Open paper

AI Security in the Foundation Model Era: A Comprehensive Survey from a Unified Perspective

Zhenyi Wang, Siyu Luan · Mar 25, 2026

Citations: 0

Match reason: Matches selected tags (Red Team).

Score: 48% Sparse protocol signal Freshness: Hot Status: Fallback

Red Team General

Open paper

SecureBreak -- A dataset towards safe and secure models

Marco Arazzi, Vignesh Kumar Kembu, Antonino Nocera · Mar 23, 2026

Citations: 0

Match reason: Matches selected tags (Red Team).

Score: 48% Sparse protocol signal Freshness: Hot Status: Fallback

Red Team General

To provide a contribution in this scenario, this paper introduces SecureBreak, a safety-oriented dataset designed to support the development of AI-driven solutions for detecting harmful LLM outputs caused by residual weaknesses in security…
The dataset is highly reliable due to careful manual annotation, where labels are assigned conservatively to ensure safety.

Open paper

Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations

Haozheng Luo, Yimin Wang, Jiahao Yu, Binghui Wang, Yan Chen · Mar 18, 2026

Citations: 0

Match reason: Matches selected tags (Red Team).

Score: 48% Sparse protocol signal Freshness: Hot Status: Fallback

Red Team General

Empirically, we evaluate CRAFT on multiple safety benchmarks using two strong reasoning models, Qwen3-4B-Thinking and R1-Distill-Llama-8B, where it consistently outperforms state-of-the-art defenses such as IPO and SafeKey.
Notably, CRAFT delivers an average 79.0% improvement in reasoning safety and 87.7% improvement in final-response safety over the base models, demonstrating the effectiveness of hidden-space reasoning alignment.

Open paper

SIA: A Synthesize-Inject-Align Framework for Knowledge-Grounded and Secure E-commerce Search LLMs with Industrial Deployment

Zhouwei Zhai, Mengxiang Chen, Anmeng Zhang · Mar 17, 2026

Citations: 0

Match reason: Matches selected tags (Red Team).

Score: 48% Sparse protocol signal Freshness: Hot Status: Fallback

Red Team General

Our approach first synthesizes high-quality natural language corpus by combining structured knowledge graphs with unstructured behavioral logs, augmented with reasoning chains and safety-aware data.
Finally, a dual-path alignment method via multi-task instruction tuning and adversarial training strengthens both task performance and safety robustness.

Open paper

Two Birds, One Projection: Harmonizing Safety and Utility in LVLMs via Inference-time Feature Projection

Yewon Han, Yumin Seol, EunGyung Kong, Minsoo Jo, Taesup Kim · Mar 16, 2026

Citations: 0

Match reason: Matches selected tags (Red Team).

Score: 48% Sparse protocol signal Freshness: Hot Status: Fallback

Red Team General

Existing jailbreak defence frameworks for Large Vision-Language Models often suffer from a safety utility tradeoff, where strengthening safety inadvertently degrades performance on general visual-grounded reasoning tasks.
In this work, we investigate whether safety and utility are inherently antagonistic objectives.

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent