Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 664 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

DySCO: Dynamic Attention-Scaling Decoding for Long-Context Language Models

Xi Ye, Wuwei Zhang, Fangcong Yin, Howard Yen, Danqi Chen · Feb 25, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics Coding
  • Across multiple instruction-tuned and reasoning models, DYSCO consistently improves performance on challenging long-context reasoning benchmarks, yielding relative gains of up to 25% on MRCR and LongBenchV2 at 128K context length with…
Open paper
OmniGAIA: Towards Native Omni-Modal AI Agents

Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Shijian Wang, Guanting Dong, Jiajie Jin · Feb 26, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 77% Sparse protocol signal Freshness: Warm Status: Ready
Tool Use General
  • To bridge this gap, we introduce OmniGAIA, a comprehensive benchmark designed to evaluate omni-modal agents on tasks necessitating deep reasoning and multi-turn tool execution across video, audio, and image modalities.
  • Furthermore, we propose OmniAtlas, a native omni-modal foundation agent under tool-integrated reasoning paradigm with active omni-modal perception.
Open paper
RAIN-Merging: A Gradient-Free Method to Enhance Instruction Following in Large Reasoning Models with Preserved Thinking Format

Zhehao Huang, Yuhang Liu, Baijiong Lin, Yixin Lou, Zhengbao He, Hanling Tian · Feb 26, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 77% Sparse protocol signal Freshness: Warm Status: Ready
General
  • Across four instruction-following benchmarks and nine reasoning & general capability benchmarks, RAIN-Merging substantially improves instruction adherence while maintaining reasoning quality.
  • The gains are consistent across model scales and architectures, translating to improved performance in agent settings.
Open paper
Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference

Bo-Wei Chen, Chung-Chi Chen, An-Zi Yen · Feb 25, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 83% High protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Tool Use General
  • Experiments on the Massive Multitask Language Understanding (MMLU) benchmark show that our approach achieves accuracy comparable to the largest model while reducing computational costs by 20\% to 40\%.
Open paper
A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection

Mirza Raquib, Asif Pervez Polok, Kedar Nath Biswas, Rahat Uddin Azad, Saydul Akbar Murad, Nick Rahimi · Feb 25, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 61% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics General
  • Evaluation uses multiple metrics, including accuracy, precision, recall, F1-score, Hamming loss, Cohens kappa, and AUC-ROC.
Open paper
Decentralized Ranking Aggregation: Gossip Algorithms for Borda and Copeland Consensus

Anna Van Elst, Kerrian Le Caillec, Igor Colin, Stephan Clémençon · Feb 26, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 57% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Multi Agent General
  • The concept of ranking aggregation plays a central role in preference analysis, and numerous algorithms for calculating median rankings, often originating in social choice theory, have been documented in the literature, offering theoretical…
  • peer-to-peer networks, IoT, multi-agent systems), extending the ability to calculate consensus rankings with guarantees in a decentralized setting, i.e., when preference data is initially distributed across a communicating network, remains…
Open paper
IndicIFEval: A Benchmark for Verifiable Instruction-Following Evaluation in 14 Indic Languages

Thanmay Jayakumar, Mohammed Safi Ur Rahman Khan, Raj Dabre, Ratish Puduppully, Anoop Kunchukuttan · Feb 25, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 54% Sparse protocol signal Freshness: Warm Status: Ready
Multilingual
  • Instruction-following benchmarks remain predominantly English-centric, leaving a critical evaluation gap for the hundreds of millions of Indic language speakers.
  • We introduce IndicIFEval, a benchmark evaluating constrained generation of LLMs across 14 Indic languages using automatically verifiable, rule-based instructions.
Open paper
Human Label Variation in Implicit Discourse Relation Recognition

Frances Yung, Daniil Ignatev, Merel Scholman, Vera Demberg, Massimo Poesio · Feb 26, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 51% Sparse protocol signal Freshness: Warm Status: Ready
General
  • There is growing recognition that many NLP tasks lack a single ground truth, as human judgments reflect diverse perspectives.
  • To capture this variation, models have been developed to predict full annotation distributions rather than majority labels, while perspectivist models aim to reproduce the interpretations of individual annotators.
Open paper
TabDLM: Free-Form Tabular Data Generation via Joint Numerical-Language Diffusion

Donghong Cai, Jiarui Feng, Yanbo Wang, Da Zheng, Yixin Chen, Muhan Zhang · Feb 26, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 51% Sparse protocol signal Freshness: Warm Status: Ready
Medicine
  • Extensive experiments on diverse benchmarks demonstrate the effectiveness of TabDLM compared to strong diffusion- and LLM-based baselines.
Open paper
Two-Stage Active Distribution Network Voltage Control via LLM-RL Collaboration: A Hybrid Knowledge-Data-Driven Approach

Xu Yang, Chenhui Lin, Xiang Ma, Dong Liu, Ran Zheng, Haotian Liu · Feb 25, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 51% Sparse protocol signal Freshness: Warm Status: Ready
General
  • Considering the operational scenarios and requirements in real-world ADNs, in this paper, we propose a hybrid knowledge-data-driven approach that leverages dynamic collaboration between a large language model (LLM) agent and a reinforcement…
  • In the day-ahead stage, the LLM agent receives coarse region-level forecasts and generates scheduling strategies for on-load tap changer (OLTC) and shunt capacitors (SCs) to regulate the overall voltage profile.
Open paper

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 54% Sparse protocol signal Freshness: Warm Status: Fallback
Critique Edit Coding
  • NLD-P is formalized as a modular control abstraction that separates provenance, constraint logic, task content, and post-generation evaluation, encoded directly in natural language without reliance on external orchestration code.
  • All conceptual framing, methodological claims, and final revisions were directed, reviewed, and approved by the human author under a documented human-in-the-loop protocol.
Open paper
Probing for Knowledge Attribution in Large Language Models

Ivo Brink, Alexander Boer, Dennis Ulmer · Feb 26, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics General
  • Probes trained on AttriWiki data reveal a strong attribution signal, achieving up to 0.96 Macro-F1 on Llama-3.1-8B, Mistral-7B, and Qwen-7B, transferring to out-of-domain benchmarks (SQuAD, WebQuestions) with 0.94-0.99 Macro-F1 without…
Open paper
SumTablets: A Transliteration Dataset of Sumerian Tablets

Cole Simmons, Richard Diehl Martinez, Dan Jurafsky · Feb 25, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 28% Sparse protocol signal Freshness: Warm Status: Ready
Coding
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Evaluating the relationship between regularity and learnability in recursive numeral systems using Reinforcement Learning

Andrea Silvi, Ponrawee Prasertsom, Jennifer Culbertson, Devdatt Dubhashi, Moa Johansson, Kenny Smith · Feb 25, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 28% Sparse protocol signal Freshness: Warm Status: Ready
General
  • Human recursive numeral systems (i.e., counting systems such as English base-10 numerals), like many other grammatical systems, are highly regular.
  • Adopting methods from the Reinforcement Learning literature, we confirm that highly regular human(-like) systems are easier to learn than unattested but possible irregular systems.
Open paper
ExpLang: Improved Exploration and Exploitation in LLM Reasoning with On-Policy Thinking Language Selection

Changjiang Gao, Zixian Huang, Kaichen Yang, Jiajun Chen, Jixing Li, Shujian Huang · Feb 25, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 32% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise Preference Multilingual
  • Analysis shows that, by enabling on-policy thinking language selection as an action during RL, ExpLang effectively extends the RL exploration space with diversified language preference and improves the RL exploitation outcome with leveraged…
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.