Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 268 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models

Boqi Chen, Xudong Liu, Jiachuan Peng, Marianne Frey-Marti, Bang Zheng, Kyle Lam · Feb 25, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready
Expert Verification Automatic Metrics MedicineCoding
  • Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity.
  • We introduce MEDSYN, a multilingual, multimodal benchmark of highly complex clinical cases with up to 7 distinct visual clinical evidence (CE) types per case.
Open paper
SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video

Guanyi Qin, Xiaozhen Wang, Zhu Zhuo, Chang Han Low, Yuancan Xiao, Yibing Fu · Feb 25, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready
Expert Verification Automatic Metrics MedicineCoding
  • Existing AI systems offer binary safety verification or static detection, ignoring the phase-dependent nature of intraoperative reasoning.
  • We introduce ResGo, a benchmark of laparoscopic frames annotated with Go Zone bounding boxes and clinician-authored rationales covering phase, exposure quality reasoning, next action and risk reminder.
Open paper
MoDora: Tree-Based Semi-Structured Document Analysis System

Bangrui Xu, Qihang Yao, Zirui Tang, Xuanhe Zhou, Yeye He, Shihan Yu · Feb 26, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics Coding
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance

Weida Liang, Yiyou Sun, Shuyuan Nan, Chuang Li, Dawn Song, Kenji Kawaguchi · Feb 26, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics MathMedicine
  • Through a controlled analysis of paired human-written and model-generated solutions, we identify a systematic dissociation between usage and executability: human- and model-derived strategies differ in structured, domain-dependent ways,…
  • Across multiple mathematical reasoning benchmarks, SSR yields reliable and consistent improvements over direct solving, in-context learning, and single-source guidance, improving accuracy by up to +13 points on AIME25 and +5 points on Apex…
Open paper
Importance of Prompt Optimisation for Error Detection in Medical Notes Using Language Models

Craig Myles, Patrick Schrempf, David Harris-Birtill · Feb 25, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics MedicineCoding
  • We show that automatic prompt optimisation with Genetic-Pareto (GEPA) improves error detection over the baseline accuracy performance from 0.669 to 0.785 with GPT-5 and 0.578 to 0.690 with Qwen3-32B, approaching the performance of medical…
Open paper
DySCO: Dynamic Attention-Scaling Decoding for Long-Context Language Models

Xi Ye, Wuwei Zhang, Fangcong Yin, Howard Yen, Danqi Chen · Feb 25, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics Coding
  • Across multiple instruction-tuned and reasoning models, DYSCO consistently improves performance on challenging long-context reasoning benchmarks, yielding relative gains of up to 25% on MRCR and LongBenchV2 at 128K context length with…
Open paper
MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification

Kazi Samin Yasar Alam, Md Tanbir Chowdhury, Tamim Ahmed, Ajwad Abrar, Md Rafid Haque · Feb 25, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics Coding
  • We construct the corpus through targeted social media collection, systematic filtering, and multi-annotator validation.
  • We benchmark transformer-based models and evaluate zero-shot large language models under structured prompting.
Open paper
One Brain, Omni Modalities: Towards Unified Non-Invasive Brain Decoding with Large Language Models

Changli Tang, Shurui Li, Junliang Wang, Qinfan Xiao, Zhonghao Zhai, Lei Bai · Feb 25, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics General
  • Extensive evaluations demonstrate that NOBEL serves as a robust generalist across standard single-modal tasks.
Open paper
XMorph: Explainable Brain Tumor Analysis Via LLM-Assisted Hybrid Deep Intelligence

Sepehr Salem Ghahfarokhi, M. Moein Esfahani, Raj Sunderraman, Vince Calhoun, Mohammed Alser · Feb 24, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics MedicineCoding
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
SpecMind: Cognitively Inspired, Interactive Multi-Turn Framework for Postcondition Inference

Cuong Chi Le, Minh V. T Pham, Tung Vu Duy, Cuong Duc Van, Huy N. Phan, Hoang N. Phan · Feb 24, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics Coding
  • Our empirical evaluation shows that SpecMind significantly outperforms state-of-the-art approaches in both accuracy and completeness of generated postconditions.
Open paper
DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning

Zhongwei Wan, Yun Shen, Zhihao Dou, Donghao Zhou, Yu Zhang, Xin Wang · Feb 23, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics Coding
  • Experiments on multiple reasoning benchmarks demonstrate consistent improvements in accuracy and pass@k, highlighting the importance of dual-scale diversity for deep exploration in RLVR.
Open paper
Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching

Roy Miles, Aysim Toker, Andreea-Maria Oncescu, Songcen Xu, Jiankang Deng, Ismail Elezi · Feb 26, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Long Horizon MathCoding
  • This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search.
  • Across math reasoning benchmarks, we find that step-level recombination is most beneficial on harder problems, and ablations highlight the importance of the final AR solver in converting stitched but imperfect rationales into accurate…
Open paper
GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang, Hao Cheng · Feb 25, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Long Horizon Coding
  • Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks.
  • Across diverse web and mobile benchmarks, GUI-Libra consistently improves both step-wise accuracy and end-to-end task completion.
Open paper
Continuous Telemonitoring of Heart Failure using Personalised Speech Dynamics

Yue Pan, Xingyao Wang, Hanyue Zhang, Liwei Liu, Changxin Li, Gang Yang · Feb 23, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Long Horizon MedicineCoding
  • The model's high sensitivity was further corroborated by additional follow-up data, confirming its efficacy in predicting HF deterioration and its potential to secure patient safety in remote, home-based settings.
Open paper
Classroom Final Exam: An Instructor-Tested Reasoning Benchmark

Chongyang Gao, Diji Yang, Shuyan Zhou, Xichen Yan, Luchuan Song, Shuo Li · Feb 23, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Long Horizon Coding
  • We introduce (Classroom Final Exam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains.
Open paper

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 83% High protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Long Horizon MedicineCoding
  • With the rapid advancement of agent-based methods in recent years, Agentic RAG has undoubtedly become an important research direction.
  • To address these challenges, we introduce AgenticRAGTracer, the first Agentic RAG benchmark that is primarily constructed automatically by large language models and designed to support step-by-step validation.
Open paper
InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models

Sayed Mohammadreza Tayaranian Hosseini, Amir Ardakani, Warren J. Gross · Feb 26, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 61% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics Math
  • Our evaluation experiments on Llama models shows that InnerQ maintains a few-shot GSM8K performance comparable to non-quantized KV caches and surpasses prior KV cache quantization methods.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.