Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 57 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,735) General (557) Long Horizon (344) Pairwise Preference (298) Coding (234) Simulation Env (201) Multi Agent (199) Medicine (119) Llm As Judge (113) Expert Verification (102) Human Eval (92) Rubric Rating (85) Web Browsing (84) Math (82) Demonstrations (73) Red Team (67)

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Spend Less, Fit Better: Budget-Efficient Scaling Law Fitting via Active Experiment Selection
Apr 24, 2026 · Citations: 0

Across a diverse benchmark of scaling-law tasks, our method consistently outperforms classical design-based baselines, and often approaches the performance of fitting on the full experimental set while using only about 10% of the total…
How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks
Apr 24, 2026 · Citations: 0

In this paper, we present the first systematic study of token consumption patterns in agentic coding tasks.
Representational Harms in LLM-Generated Narratives Against Global Majority Nationalities
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
Apr 24, 2026 · Citations: 0

Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities.
Relaxation-Informed Training of Neural Network Surrogate Models
Apr 24, 2026 · Citations: 0

Experiments on non-convex benchmark functions and a two-stage stochastic programming problem with quantile neural network surrogates demonstrate that the proposed regularizers can reduce MILP solve times by up to four orders of magnitude…
An Undecidability Proof for the Plan Existence Problem
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Aligning Dense Retrievers with LLM Utility via DistillationAligning Dense Retrievers with LLM Utility via Distillation
Apr 24, 2026 · Citations: 0

On the QASPER benchmark, UAE improves retrieval Recall@1 by 30.59%, MAP by 30.16% and Token F1 by 17.3% over the strong semantic baseline BGE-Base.
Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Time-Localized Parametric Decomposition of Respiratory Airflow for Sub-Breath Analysis
Apr 24, 2026 · Citations: 0

Evaluation across 8,276 breaths demonstrates high reconstruction accuracy (mean squared error < 0.001 for four-component models) and robust parameter precision under moderate noise.
CRAFT: Clustered Regression for Adaptive Filtering of Training data
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search

Xun Huang, Simeng Qin, Xiaoshuang Jia, Ranjie Duan, Huanqian Yan, Zhitao Zeng · Feb 26, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready

Red Team Automatic Metrics Multilingual

Owing to its conciseness and obscurity, classical Chinese can partially bypass existing safety constraints, exposing notable vulnerabilities in LLMs.
To enhance readability and evaluation accuracy, we further design a classical Chinese to English translation module.

Open paper

MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models

Boqi Chen, Xudong Liu, Jiachuan Peng, Marianne Frey-Marti, Bang Zheng, Kyle Lam · Feb 25, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready

Expert Verification Automatic Metrics MedicineCoding

Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity.
We introduce MEDSYN, a multilingual, multimodal benchmark of highly complex clinical cases with up to 7 distinct visual clinical evidence (CE) types per case.

Open paper

Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages

Somnath Banerjee, Rima Hazra, Animesh Mukherjee · Feb 14, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready

Pairwise Preference Automatic Metrics CodingMultilingual

Yet safety pipelines, benchmarks, and alignment still largely target English and a handful of high-resource languages, implicitly assuming safety and factuality ''transfer'' across languages.
We synthesize recent findings indicating that (i) safety guardrails weaken sharply on low-resource and code-mixed inputs, (ii) culturally harmful behavior can persist even when standard toxicity scores look acceptable, and (iii)…

Open paper

Tutoring Large Language Models to be Domain-adaptive, Precise, and Safe

Somnath Banerjee · Feb 14, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready

Pairwise Preference Long Horizon Multilingual

The methodological trajectory moves from classical supervised adaptation for task-specific demands to decoding-time alignment for safety, finally leveraging human feedback and preference modeling to achieve sociolinguistic acuity.

Open paper

Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching

Stephen Gadd · Jan 11, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 58% High protocol signal Freshness: Warm Status: Ready

Expert Verification Automatic Metrics Multilingual

Trained on 32.7 million triplet samples drawn from 67 million toponyms spanning GeoNames, Wikidata, and the Getty Thesaurus of Geographic Names, the Student achieves the highest Recall@1 (85.2%) and Mean Reciprocal Rank (90.8%) on the…
The approach naturally handles pre-standardisation orthographic variation characteristic of historical documents, and transfers effectively to personal names in archival sources, suggesting broad applicability to name resolution tasks in…

Open paper

CRANE: Causal Relevance Analysis of Language-Specific Neurons in Multilingual Large Language Models

Yifan Le, Yunliang Li · Jan 8, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready

Pairwise Preference Automatic Metrics Multilingual

Prior work has identified language-related neurons mainly through activation-based heuristics, which conflate language preference with functional importance.
Experiments on English, Chinese, and Vietnamese across multiple benchmarks, together with a dedicated relevance-based metric and base-to-chat model transfer analysis, show that CRANE isolates language-specific components more precisely than…

Open paper

Rethinking Metrics for Lexical Semantic Change Detection

Roksana Goworek, Haim Dubossarsky · Feb 17, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready

Pairwise Preference Automatic Metrics CodingMultilingual

Open paper

Unlocking Reasoning Capability on Machine Translation in Large Language Models

Sara Rajaee, Sebastian Vincent, Alexandre Berard, Marzieh Fadaee, Kelly Marchisio, Tom Kocmi · Feb 16, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready

Critique Edit Long Horizon MathCoding

We systematically evaluate several open- and closed-weights RLMs on the WMT24++ benchmark and find that enabling explicit reasoning consistently degrades translation quality across languages and models.

Open paper

SAMAS: A Spectrum-Guided Multi-Agent System for Achieving Style Fidelity in Literary Translation

Jingzhuo Wu, Jiajun Zhang, Keyan Jin, Dehua Ma, Junbo Wang · Feb 23, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback

Automatic Metrics Multi Agent Multilingual

To address this, we introduce the Style-Adaptive Multi-Agent System (SAMAS), a novel framework that treats style preservation as a signal processing task.
Extensive experiments on translation benchmarks show that SAMAS achieves competitive semantic accuracy against strong baselines, primarily by leveraging its statistically significant advantage in style fidelity.

Open paper

The Sufficiency-Conciseness Trade-off in LLM Self-Explanation from an Information Bottleneck Perspective

Ali Zahedzadeh, Behnam Bahrak · Feb 15, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 58% High protocol signal Freshness: Warm Status: Fallback

Automatic Metrics Long Horizon Multilingual

Building on the information bottleneck principle, we conceptualize explanations as compressed representations that retain only the information essential for producing correct answers.To operationalize this view, we introduce an evaluation…

Open paper

EnsembleLink: Accurate Record Linkage Without Training Data

Noah Dasanaike · Jan 29, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback

Automatic Metrics Tool Use Multilingual

On benchmarks spanning city names, person names, organizations, multilingual political parties, and bibliographic records, EnsembleLink matches or exceeds methods requiring extensive labeling.

Open paper

JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models

Junyu Liu, Zirui Li, Qian Niu, Zequn Zhang, Yue Xun, Wenlong Hou · Jan 4, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 55% Moderate protocol signal Freshness: Warm Status: Fallback

Red Team MedicineMultilingual

To address these gaps, we introduce JMedEthicBench, the first multi-turn conversational benchmark for evaluating medical safety of LLMs for Japanese healthcare.
Using a dual-LLM scoring protocol, we evaluate 27 models and find that commercial models maintain robust safety while medical-specialized models exhibit increased vulnerability.

Open paper

ExpLang: Improved Exploration and Exploitation in LLM Reasoning with On-Policy Thinking Language Selection

Changjiang Gao, Zixian Huang, Kaichen Yang, Jiajun Chen, Jixing Li, Shujian Huang · Feb 25, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback

Pairwise Preference Multilingual

Analysis shows that, by enabling on-policy thinking language selection as an action during RL, ExpLang effectively extends the RL exploration space with diversified language preference and improves the RL exploitation outcome with leveraged…

Open paper

IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages

Priyaranjan Pattnayak, Sanchari Chowdhuri · Feb 18, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback

Red Team CodingMultilingual

Safety alignment of large language models (LLMs) is mostly evaluated in English and contract-bound, leaving multilingual vulnerabilities understudied.
We introduce Indic Jailbreak Robustness (IJR), a judge-free benchmark for adversarial safety across 12 Indic and South Asian languages (2.1 Billion speakers), covering 45216 prompts in JSON (contract-bound) and Free (naturalistic) tracks.

Open paper

Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment

Yuyan Bu, Xiaohao Liu, ZhaoXing Ren, Yaodong Yang, Juntao Dai · Feb 18, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback

Pairwise Preference Multilingual

The widespread deployment of large language models (LLMs) across linguistic communities necessitates reliable multilingual safety alignment.
In this work, we propose a resource-efficient method for improving multilingual safety alignment.

Open paper

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

Nivya Talokar, Ayush K Tarun, Murari Mandal, Maksym Andriushchenko, Antoine Bosselut · Feb 18, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback

Red Team LawMultilingual

LLM-based agents execute real-world workflows via tools and memory.
We introduce STING (Sequential Testing of Illicit N-step Goal execution), an automated red-teaming framework that constructs a step-by-step illicit plan grounded in a benign persona and iteratively probes a target agent with adaptive…

Open paper

Language Steering for Multilingual In-Context Learning

Neeraja Kirtane, Kuan-Hao Huang · Feb 2, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback

Demonstrations Multilingual

Open paper

A Parallel Cross-Lingual Benchmark for Multimodal Idiomaticity Understanding

Dilara Torunoğlu-Selamet, Dogukan Arslan, Rodrigo Wilkens, Wei He, Doruk Eryiğit, Thomas Pickard · Jan 13, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback

Pairwise Preference Multilingual

The dataset, containing 34 languages and over ten thousand items, allows comparative analyses of idiomatic patterns among language-specific realisations and preferences in order to gather insights about shared cultural aspects.
The result is a high-quality benchmark for evaluating multilingual and multimodal idiomatic language understanding.

Open paper

Beyond Literal Mapping: Benchmarking and Improving Non-Literal Translation Evaluation

Yanzhi Tian, Cunxiang Wang, Zeming Liu, Heyan Huang, Wenbo Yu, Dawei Song · Jan 12, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback

Llm As JudgeAutomatic Metrics CodingMultilingual

To systematically investigate the reliability of MT metrics, we first curate a meta-evaluation dataset focused on non-literal translations, namely MENT.
To mitigate these limitations, we propose RATE, a novel agentic translation evaluation framework, centered by a reflective Core Agent that dynamically invokes specialized sub-agents.

Open paper

Enhancing Multilingual RAG Systems with Debiased Language Preference-Guided Query Fusion

Jeonghyun Park, Byeongjeong Kim, Seojin Hwang, Hwanhee Lee · Jan 6, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback

Pairwise Preference Multilingual

To address these biases, we propose DeLP (Debiased Language Preference), a calibrated metric designed to explicitly factor out these structural confounds.
Building on this insight, we introduce DELTA (DEbiased Language preference-guided Text Augmentation), a lightweight and efficient mRAG framework that strategically leverages monolingual alignment to optimize cross-lingual retrieval and…

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent