Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 72 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

GLiGuard: Schema-Conditioned Classification for LLM Safeguard

Urchade Zaratiana, Mary Newhauser, George Hurn-Maloney, Ash Lewis · May 8, 2026

Citations: 0

Match reason: Matches selected tags (Red Team).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Red Team Automatic Metrics Coding
  • Ensuring safe, policy-compliant outputs from large language models requires real-time content moderation that can scale across multiple safety dimensions.
  • Across nine established safety benchmarks, GLiGuard achieves F1 scores competitive with 7B--27B decoder-based guards despite being 23--90\times smaller, while delivering up to 16\times higher throughput and 17\times lower latency.
Open paper
Citations: 0

Match reason: Matches selected tags (Red Team).

Score: 65% High protocol signal Freshness: Hot Status: Ready
Red Team Automatic Metrics General
  • In the PsyDefDetect shared task at BioNLP 2026 the eight positive defence categories share surface language and differ only in pragmatic function and trained raters reach only moderate inter-annotator agreement.
Open paper
Improving the Efficiency of Language Agent Teams with Adaptive Task Graphs

Elizabeth Mieczkowski, Alexander Ku, Tiwalayo Eisape, Dilip Arumugam, John Matters, Katherine M. Collins · May 7, 2026

Citations: 0

Match reason: Matches selected tags (Red Team).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Red Team Automatic Metrics General
  • In contrast, fully unstructured teams enable adaptability and exploration but suffer from inefficiencies such as error propagation, inter-agent conflicts, and wasted resources (measured in time, tokens, or file operations).
  • We introduce Language Agent Teams for Task Evolution (LATTE), a framework for coordinating LLM teams inspired by distributed systems, where processors must operate under partial observability and communication constraints.
Open paper
SHAPE: Unifying Safety, Helpfulness and Pedagogy for Educational LLMs

Sihang, Zhao, Kangrui Yu, Youliang Yuan, Pinjia He, Hongyi Wen · Apr 24, 2026

Citations: 0

Match reason: Matches selected tags (Red Team).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Red Team Automatic Metrics Coding
  • To enable systematic study, we unify and formalize safe, helpful, and pedagogical behaviors with a knowledge-mastery graph and introduce SHAPE, a benchmark of 9,087 student-question pairs for evaluating tutoring behavior under adversarial…
  • Experiments across multiple LLMs show that our method yields significantly improved safety under two pedagogical jailbreak settings, while maintaining near-ceiling helpfulness under the same evaluation protocol.
Open paper
Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization

Zheng Fang, Xiaosen Wang, Shenyi Zhang, Shaokang Wang, Zhijin Ge · May 6, 2026

Citations: 0

Match reason: Matches selected tags (Red Team).

Score: 62% Moderate protocol signal Freshness: Hot Status: Fallback
Red Team Math
  • These results demonstrate that dense waveform updates are largely redundant, and we advocate that future audio jailbreak and safety alignment research should further leverage this heterogeneous token-level gradient structure.
Open paper
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, Yun-Nung Chen · Apr 8, 2026

Citations: 0

Match reason: Matches selected tags (Red Team).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Red Team Automatic Metrics Long Horizon General
  • As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces.
  • To address this gap, we introduce TraceSafe-Bench, the first comprehensive benchmark specifically designed to assess mid-trajectory safety.
Open paper
Do No Harm: Exposing Hidden Vulnerabilities of LLMs via Persona-based Client Simulation Attack in Psychological Counseling

Qingyang Xu, Yaling Shen, Stephanie Fong, Zimu Wang, Yiwen Jiang, Xiangyu Zhao · Apr 6, 2026

Citations: 0

Match reason: Matches selected tags (Red Team).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Red Team Simulation Env Medicine
  • The increasing use of large language models (LLMs) in mental healthcare raises safety concerns in high-stakes therapeutic interactions.
  • To address this gap, we introduce Personality-based Client Simulation Attack (PCSA), the first red-teaming framework that simulates clients in psychological counseling through coherent, persona-driven client dialogues to expose…
Open paper
Prompt Attack Detection with LLM-as-a-Judge and Mixture-of-Models

Hieu Xuan Le, Benjamin Goh, Quy Anh Tang · Mar 26, 2026

Citations: 0

Match reason: Matches selected tags (Red Team).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Red Team Llm As Judge General
  • In production, guardrails must mitigate these attacks under strict low-latency constraints, resulting in a deployment gap in which lightweight classifiers and rule-based systems struggle to generalize under distribution shift, while…
  • In this work, we examine whether lightweight, general-purpose LLMs can reliably serve as security judges under real-world production constraints.
Open paper
Citations: 0

Match reason: Matches selected tags (Red Team).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Red Team Automatic Metrics General
  • While most red-teaming work emphasizes adversarial prompt search (input-space optimization), we show that safety failures can also be systematically exposed through diverse response generation (output-space exploration) for a fixed…
  • Across multiple jailbreak benchmarks and open-source LLMs, PDPS achieves attack success rates comparable to large-scale IID sampling while using only 8% to 29% of the computational cost.
Open paper
SemEval-2026 Task 6: CLARITY -- Unmasking Political Question Evasions

Konstantinos Thomas, Giorgos Filandrianos, Maria Lymperaiou, Chrysoula Zerva, Giorgos Stamou · Mar 14, 2026

Citations: 0

Match reason: Matches selected tags (Red Team).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Red Team Automatic Metrics General
  • The benchmark is constructed from U.S.
  • CLARITY establishes political response evasion as a challenging benchmark for computational discourse analysis and highlights the difficulty of modeling strategic ambiguity in political language.
Open paper
Red-Teaming Vision-Language-Action Models via Quality Diversity Prompt Generation for Robust Robot Policies

Siddharth Srikanth, Freddie Liang, Ya-Chuan Hsu, Varun Bhatt, Shihan Zhao, Henry Chen · Mar 12, 2026

Citations: 0

Match reason: Matches selected tags (Red Team).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Red Team Simulation Env General
  • Our results across multiple simulation benchmarks show that Q-DIG finds more diverse and meaningful failure modes compared to baseline methods, and that fine-tuning VLAs on the generated instructions improves task success rates.
  • Furthermore, results from a user study highlight that Q-DIG generates prompts judged to be more natural and human-like than those from baselines.
Open paper
WebWeaver: Breaking Topology Confidentiality in LLM Multi-Agent Systems with Stealthy Context-Based Inference

Zixun Xiong, Gaoyi Wu, Lingfeng Yao, Miao Pan, Xiaojiang Du, Hao Wang · Mar 11, 2026

Citations: 0

Match reason: Matches selected tags (Red Team).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Red Team Automatic Metrics Multi Agent General
  • Communication topology is a critical factor in the utility and safety of LLM-based multi-agent systems (LLM-MAS), making it a high-value intellectual property (IP) whose confidentiality remains insufficiently studied.
  • To bridge this realism gap, we propose WebWeaver, an attack framework that infers the complete LLM-MAS topology by compromising only a single arbitrary agent instead of the administrative agent.
Open paper
IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs

Chuan Guo, Juan Felipe Ceron Uribe, Sicheng Zhu, Christopher A. Choquette-Choo, Steph Lin, Nikhil Kandpal · Mar 11, 2026

Citations: 0

Match reason: Matches selected tags (Red Team).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Red Team Automatic Metrics General
  • IH is key to defending against jailbreaks, system prompt extractions, and agentic prompt injections.
  • Fine-tuning GPT-5-Mini on IH-Challenge with online adversarial example generation improves IH robustness by +10.0% on average across 16 in-distribution, out-of-distribution, and human red-teaming benchmarks (84.1% to 94.1%), reduces unsafe…
Open paper
Exclusive Unlearning

Mutsumi Sasaki, Kouta Nakayama, Yusuke Miyao, Yohei Oseki, Masaru Isonuma · Apr 7, 2026

Citations: 0

Match reason: Matches selected tags (Red Team).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Red Team Math
  • We demonstrate that through Exclusive Unlearning, it is possible to obtain a model that ensures safety against a wide range of inputs, including jailbreaks, while maintaining the ability to respond to diverse instructions related to…
Open paper
Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

Bilgehan Sel, Xuanli He, Alwin Peng, Ming Jin, Jerry Wei · Mar 30, 2026

Citations: 0

Match reason: Matches selected tags (Red Team).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Red Team General
  • Fine-tuning APIs offered by major AI providers create new attack surfaces where adversaries can bypass safety measures through targeted fine-tuning.
  • Crucially, while prior adversarial fine-tuning approaches report more than 25% capability degradation on reasoning benchmarks, Trojan-Speak incurs less than 5% degradation while achieving 99+% classifier evasion for models with 14B+…
Open paper
SecureBreak -- A dataset towards safe and secure models

Marco Arazzi, Vignesh Kumar Kembu, Antonino Nocera · Mar 23, 2026

Citations: 0

Match reason: Matches selected tags (Red Team).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Red Team General
  • To provide a contribution in this scenario, this paper introduces SecureBreak, a safety-oriented dataset designed to support the development of AI-driven solutions for detecting harmful LLM outputs caused by residual weaknesses in security…
  • The dataset is highly reliable due to careful manual annotation, where labels are assigned conservatively to ensure safety.
Open paper
Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations

Haozheng Luo, Yimin Wang, Jiahao Yu, Binghui Wang, Yan Chen · Mar 18, 2026

Citations: 0

Match reason: Matches selected tags (Red Team).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Red Team General
  • Empirically, we evaluate CRAFT on multiple safety benchmarks using two strong reasoning models, Qwen3-4B-Thinking and R1-Distill-Llama-8B, where it consistently outperforms state-of-the-art defenses such as IPO and SafeKey.
  • Notably, CRAFT delivers an average 79.0% improvement in reasoning safety and 87.7% improvement in final-response safety over the base models, demonstrating the effectiveness of hidden-space reasoning alignment.
Open paper

Match reason: Matches selected tags (Red Team).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Red Team General
  • Our approach first synthesizes high-quality natural language corpus by combining structured knowledge graphs with unstructured behavioral logs, augmented with reasoning chains and safety-aware data.
  • Finally, a dual-path alignment method via multi-task instruction tuning and adversarial training strengthens both task performance and safety robustness.
Open paper
Two Birds, One Projection: Harmonizing Safety and Utility in LVLMs via Inference-time Feature Projection

Yewon Han, Yumin Seol, EunGyung Kong, Minsoo Jo, Taesup Kim · Mar 16, 2026

Citations: 0

Match reason: Matches selected tags (Red Team).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Red Team General
  • Existing jailbreak defence frameworks for Large Vision-Language Models often suffer from a safety utility tradeoff, where strengthening safety inadvertently degrades performance on general visual-grounded reasoning tasks.
  • In this work, we investigate whether safety and utility are inherently antagonistic objectives.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.