Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 479 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards

Youliang Yuan, Qiuyang Mang, Jingbang Chen, Hong Wan, Xiaoyuan Liu, Junjielong Xu · Oct 9, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 78% High protocol signal Freshness: Cold Status: Ready
Rubric Rating Automatic Metrics Long Horizon MathLaw
  • Through a systematic analysis with human verification, we establish a taxonomy of these failure modes, identifying patterns like Miracle Steps-abrupt jumps to a correct output without a valid preceding derivation.
  • When integrated into an RL pipeline, RRM-based training consistently outperforms outcome-only supervision across four math benchmarks.
Open paper
Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation

Mohsen Hariri, Amirhossein Samandar, Michael Hinczewski, Vipin Chaudhary · Oct 5, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 78% High protocol signal Freshness: Cold Status: Ready
Rubric Rating Automatic MetricsSimulation Env Coding
  • We present a principled Bayesian evaluation framework that replaces Pass@k and average accuracy over N trials (avg@N) with posterior estimates of a model's underlying success probability and credible intervals, yielding stable rankings and…
  • Together, these results recommend replacing Pass@k for LLM evaluation and ranking with a posterior-based, compute-efficient protocol that unifies binary and non-binary evaluation while making uncertainty explicit.
Open paper
SPELL: Self-Play Reinforcement Learning for Evolving Long-Context Language Models

Ziyi Yang, Weizhou Shen, Chenliang Li, Ruijun Chen, Fanqi Wan, Ming Yan · Sep 28, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics Coding
  • This gap arises not only from the intrinsic difficulty of processing long texts, but also from the scarcity of reliable human annotations and programmatically verifiable reward signals.
  • Extensive experiments on six long-context benchmarks show that SPELL consistently improves performance across diverse LLMs and outperforms equally sized models fine-tuned on large-scale annotated data.
Open paper
ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models

Jewon Lee, Wooksu Shin, Seungmin Yang, Ki-Ung Song, DongUk Lim, Jaeyeon Kim · Sep 26, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics Coding
  • For instance, ERGO surpasses Qwen2.5-VL-7B on the V* benchmark by 4.7 points while using only 23% of the vision tokens, achieving a 3x inference speedup.
Open paper
A Linguistics-Aware LLM Watermarking via Syntactic Predictability

Shinwoo Park, Hyejin Park, Hyeseon An, Yo-Sub Han · Oct 10, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 68% Sparse protocol signal Freshness: Cold Status: Ready
Coding
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Polychromic Objectives for Reinforcement Learning

Jubayer Ibn Hamid, Ifdita Hasan Orney, Ellen Xu, Chelsea Finn, Dorsa Sadigh · Sep 29, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 68% Sparse protocol signal Freshness: Cold Status: Ready
General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning

Xin Cheng, Yuyue Wang, Xihua Wang, Yihan Wu, Kaisi Guan, Yijing Chen · Sep 29, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 68% Sparse protocol signal Freshness: Cold Status: Ready
General
  • Extensive experiments on V2S, VisualTTS and joint generation benchmarks show that VSSFlow effectively unifies these tasks and surpasses state-of-the-art domain-specific baselines, underscoring the critical potential of unified generative…
Open paper
AutoEP: LLMs-Driven Automation of Hyperparameter Evolution for Metaheuristic Algorithms

Zhenxing Xu, Yizhe Zhang, Weidong Bao, Hao Wang, Ming Chen, Haoran Ye · Sep 27, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 68% Sparse protocol signal Freshness: Cold Status: Ready
Coding
  • Evaluated on three distinct metaheuristics across diverse combinatorial optimization benchmarks, AutoEP consistently outperforms state-of-the-art tuners, including neural evolution and other LLM-based methods.
Open paper
Fine-tuning Done Right in Model Editing

Wanli Yang, Rui Tang, Hongyu Zang, Du Su, Qi Cao, Jingang Wang · Sep 26, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 68% Sparse protocol signal Freshness: Cold Status: Ready
General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Peeking inside the Black-Box: Reinforcement Learning for Explainable and Accurate Relation Extraction

Xinyu Guo, Zhengliang Shi, Minglai Yang, Mahdi Rahimi, Mihai Surdeanu · Oct 7, 2025

Citations: 0

Match reason: Keyword overlap 1/1 across title and protocol fields.

Score: 75% Moderate protocol signal Freshness: Cold Status: Fallback
Human EvalAutomatic Metrics General
  • Finally, human evaluation shows that our best model generates relational keywords closely aligned with gold labels, increasing human explanation quality ratings by 54% (relative).
Open paper
Native Hybrid Attention for Efficient Sequence Modeling

Jusen Du, Jiaxi Hu, Tao Zhang, Weigao Sun, Yu Cheng · Oct 8, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 30% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics Coding
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru · Oct 6, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 30% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics General
  • We introduce ACE (Agentic Context Engineering), a framework that treats contexts as evolving playbooks that accumulate, refine, and organize strategies through a modular process of generation, reflection, and curation.
  • Across agent and domain-specific benchmarks, ACE optimizes contexts both offline (e.g., system prompts) and online (e.g., agent memory), consistently outperforming strong baselines: +10.6% on agents and +8.6% on finance, while significantly…
Open paper
Ultra-Fast Language Generation via Discrete Diffusion Divergence Instruct

Haoyang Zheng, Xinyang Liu, Cindy Xiangrui Kong, Nan Jiang, Zheyuan Hu, Weijian Luo · Sep 29, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 30% Moderate protocol signal Freshness: Cold Status: Ready
Automatic Metrics General
  • On the OpenWebText benchmark, DiDi-Instruct achieves perplexity ranging from 62.2 (8 NFEs) to 18.4 (128 NFEs), outperforming prior accelerated dLLMs and the GPT-2 baseline.
  • We further validate the robustness and effectiveness of DiDi-Instruct through extensive ablation studies, model scaling, downstream task evaluations, and unconditional protein sequence generation.
Open paper
PonderLM-2: Pretraining LLM with Latent Thoughts in Continuous Space

Boyi Zeng, He Li, Shixiang Song, Yixuan Wang, Zitong Wang, Ziwei He · Sep 27, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 26% Sparse protocol signal Freshness: Cold Status: Ready
Coding
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
PIKA: Expert-Level Synthetic Datasets for Post-Training Alignment from Scratch

Shangjian Yin, Shining Liang, Wenbiao Ding, Yuli Qian, Zhouxing Shi, Hongzhi Li · Oct 8, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 30% Moderate protocol signal Freshness: Cold Status: Fallback
Pairwise Preference Coding
  • Despite its small size, fine-tuning Llama-3-8B-Base on PiKa-SFT even outperforms the official Llama-3-8B-Instruct model trained on over 10M proprietary examples on widely used benchmarks such as AlpacaEval 2.0 and Arena-Hard.
  • Additionally, we provide 30k high-quality preference optimization examples to further enhance alignment.
Open paper
Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning

Yucheng Wang, Yifan Hou, Aydin Javadov, Mubashara Akhtar, Mrinmaya Sachan · Sep 28, 2025

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 26% Sparse protocol signal Freshness: Cold Status: Fallback
Pairwise Preference General
  • These inconsistencies stem from a lack of controlled evaluation frameworks and analysis of models' internals to isolate when and why modality interactions support or undermine reasoning.
  • We address this gap through a logic-grounded evaluation framework that categorizes multimodal reasoning into six interaction patterns, varying how facts are distributed across modalities and logically combined.
Open paper

Protocol Hubs

Benchmark Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.