Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 195 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

Task Arithmetic with Support Languages for Low-Resource ASR

Emma Rafkin, Dan DeGenaro, Xiulin Yang · Jan 11, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning

Qiusi Zhan, Hyeonjeong Ha, Rui Yang, Sirui Xu, Hanyang Chen, Liang-Yan Gui · Oct 31, 2025

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 78% High protocol signal Freshness: Cold Status: Ready
Pairwise Preference Automatic Metrics Long Horizon General
  • We introduce BEAT, the first framework to inject such visual backdoors into VLM-based embodied agents using objects in the environments as triggers.
  • Across various embodied agent benchmarks and VLMs, BEAT achieves attack success rates up to 80%, while maintaining strong benign task performance, and generalizes reliably to out-of-distribution trigger placements.
Open paper

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 78% High protocol signal Freshness: Cold Status: Ready
Pairwise Preference Automatic Metrics Multi Agent General
  • While a multi-agent approach based on large language models (LLMs) represents a promising strategy to surpass the capabilities of single models, its success is critically dependent on synergistic team composition.
  • Priming conversations with specific topics identified synergistic teams which outperform random baselines on downstream benchmarks and achieve comparable accuracy to that of manually-curated teams based on known model specializations.
Open paper
Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning

Ran Xu, Jingjing Chen, Jiayu Ye, Yu Wu, Jun Yan, Carl Yang · Oct 27, 2025

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Cold Status: Ready
Pairwise Preference Human Eval Coding
  • Motivated by the success of tool-integrated reasoning (TIR) in numerous tasks, we propose TIR-Judge, an end-to-end RL framework for training LLM judges that integrates a code executor for precise evaluation.
  • On seven public benchmarks, TIR-Judge surpasses strong reasoning-based judges by up to 6.4% (pointwise) and 7.7% (pairwise), and achieves listwise performance comparable to Claude-Opus-4 despite having only 8B parameters.
Open paper
Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation

Vaibhav Srivastav, Steven Zheng, Eric Bezzam, Eustache Le Bihan, Nithin Rao Koluguri, Piotr Żelasko · Oct 8, 2025

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics CodingMultilingual
  • We present the Open ASR Leaderboard, a reproducible benchmarking platform with community contributions from academia and industry.
  • We present our evaluation methodology to facilitate community-driven benchmarking in ASR and other tasks.
Open paper
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu, Xiaochen Wang · Oct 29, 2025

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 78% High protocol signal Freshness: Cold Status: Fallback
Automatic Metrics Long Horizon General
  • To address this gap, we introduce the Tool Decathlon (dubbed as Toolathlon), a benchmark for language agents offering diverse Apps and tools, realistic environment setup, and reliable execution-based evaluation.
  • Comprehensive evaluation of SOTA models highlights their significant shortcomings: the best-performing model, Claude-4.5-Sonnet, achieves only a 38.6% success rate with 20.2 tool calling turns on average, while the top open-weights model…
Open paper
Reasoning Up the Instruction Ladder for Controllable Language Models

Zishuo Zheng, Vidhisha Balachandran, Chan Young Park, Faeze Brahman, Sachin Kumar · Oct 30, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 56% Moderate protocol signal Freshness: Cold Status: Ready
Red Team Automatic Metrics General
  • Our finetuned models achieve consistent improvements on instruction following and instruction hierarchy benchmarks, achieving roughly a 20% improvement on the IHEval conflict setup.
  • By treating safety issues as resolving conflicts between adversarial user inputs and predefined higher-priority policies, our trained model enhances robustness against jailbreak and prompt injection attacks, providing up to a 20% reduction…
Open paper
MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation

Chengshu Li, Mengdi Xu, Arpit Bahety, Hang Yin, Yunfan Jiang, Huang Huang · Oct 21, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 56% Moderate protocol signal Freshness: Cold Status: Ready
Demonstrations Simulation Env Long Horizon General
  • Imitation learning from large-scale, diverse human demonstrations has been shown to be effective for training robots, but collecting such data is costly and time-consuming.
  • This challenge intensifies for multi-step bimanual mobile manipulation, where humans must teleoperate both the mobile base and two high-DoF arms.
Open paper

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 52% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics General
  • DELULU significantly outperforms prior SSL models across a range of speaker-centric tasks, achieving up to 62\% relative improvement in equal error rate (EER) for speaker verification and consistent gains on zero-shot profiling tasks…
Open paper
FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution

Syed Rifat Raiyan, Md Farhan Ishmam, Abdullah Al Imran, Mohammad Ali Moni · Oct 18, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 52% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics General
  • Human communication heavily relies on laconism and inferential pragmatics, allowing listeners to successfully reconstruct rich meaning from sparse, telegraphic speech.
Open paper
DropVLA: An Action-Level Backdoor Attack on Vision--Language--Action Models

Zonghuan Xu, Xiang Zheng, Xingjun Ma, Yu-Gang Jiang · Oct 13, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 52% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics General
  • The backdoor remains robust to moderate trigger variations and transfers across evaluation suites (96.27%, 99.09%), whereas text-only largely fails (0.72%).
  • These results reveal that VLA models can be covertly steered at the granularity of safety-critical actions with minimal poisoning and without observable degradation of nominal performance.
Open paper
RePo: Language Models with Context Re-Positioning

Huayang Li, Tianyu Zhao, Deng Cai, Richard Sproat · Dec 16, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 49% Sparse protocol signal Freshness: Cold Status: Ready
Coding
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Constant-Time Motion Planning with Manipulation Behaviors

Nayesha Gandotra, Itamar Mishani, Maxim Likhachev · Nov 30, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 49% Sparse protocol signal Freshness: Cold Status: Ready
Simulation Env General
  • One of the key barriers is the lack of motion planning algorithms that can provide verifiable guarantees for safety, efficiency and reliability.
Open paper
Happiness is Sharing a Vocabulary: A Study of Transliteration Methods

Haeji Jung, Jinju Kim, Kyungjin Kim, Youjeong Roh, David R. Mortensen · Oct 12, 2025

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 46% Sparse protocol signal Freshness: Cold Status: Ready
Multilingual
  • We evaluate each model on three downstream tasks -- named entity recognition (NER), part-of-speech tagging (POS) and natural language inference (NLI) -- and find that romanization significantly outperforms other input types in 11 out of 12…
Open paper
CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics

Vaibhav Devraj, Dhruv Kumar, Jagat Sesh Challa, Parth Agarwal, Navya Kommuri, Trizal Garg · Dec 26, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% High protocol signal Freshness: Warm Status: Ready
Expert Verification Automatic Metrics CodingMultilingual
  • To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data.
  • We evaluate six state-of-the-art models, including GPT-4o, Claude 3.7 Sonnet, and open-source models, using a strict evaluation protocol.
Open paper

Match reason: Matched by broad semantic/index fallback.

Score: 28% Sparse protocol signal Freshness: Warm Status: Ready
General
  • However, their reliance on large-scale datasets-often collected from human or web sources-makes them vulnerable to backdoor attacks, where adversaries poison a small subset of data to implant hidden behaviors.
Open paper
When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling

Heecheol Yun, Kwangmin Ki, Junghyun Lee, Eunho Yang · Oct 17, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 33% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics Math
  • Our experiments on diverse benchmarks, including MATH500 and BBH, demonstrate that SAFE outperforms existing methods in both accuracy and efficiency, with gains achieved even when ensembling fewer than 1% of tokens.
Open paper
Action Without Interaction: Probing the Physical Foundations of Video LMMs via Contact-Release Detection

Daniel Harari, Michael Sidorov, Chen Shterental, Liel David, Abrham Kahsay Gebreselasie, Muhammad Haris Khan · Nov 25, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 23% Sparse protocol signal Freshness: Cold Status: Ready
General
  • 250 AMTurk human annotators labeled core interaction events, particularly when and where objects and agents become attached (`contact') or detached (`release').
Open paper
Verifying Chain-of-Thought Reasoning via Its Computational Graph

Zheng Zhao, Yeskendir Koishekenov, Xianjun Yang, Naila Murray, Nicola Cancedda · Oct 10, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 23% Sparse protocol signal Freshness: Cold Status: Ready
General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.