Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 57 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search

Xun Huang, Simeng Qin, Xiaoshuang Jia, Ranjie Duan, Huanqian Yan, Zhitao Zeng · Feb 26, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Red Team Automatic Metrics Multilingual
  • Owing to its conciseness and obscurity, classical Chinese can partially bypass existing safety constraints, exposing notable vulnerabilities in LLMs.
  • To enhance readability and evaluation accuracy, we further design a classical Chinese to English translation module.
Open paper
MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models

Boqi Chen, Xudong Liu, Jiachuan Peng, Marianne Frey-Marti, Bang Zheng, Kyle Lam · Feb 25, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Expert Verification Automatic Metrics MedicineCoding
  • Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity.
  • We introduce MEDSYN, a multilingual, multimodal benchmark of highly complex clinical cases with up to 7 distinct visual clinical evidence (CE) types per case.
Open paper
Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics CodingMultilingual
  • Yet safety pipelines, benchmarks, and alignment still largely target English and a handful of high-resource languages, implicitly assuming safety and factuality ''transfer'' across languages.
  • We synthesize recent findings indicating that (i) safety guardrails weaken sharply on low-resource and code-mixed inputs, (ii) culturally harmful behavior can persist even when standard toxicity scores look acceptable, and (iii)…
Open paper

Match reason: Matches selected tags (Multilingual).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Long Horizon Multilingual
  • The methodological trajectory moves from classical supervised adaptation for task-specific demands to decoding-time alignment for safety, finally leveraging human feedback and preference modeling to achieve sociolinguistic acuity.
Open paper

Match reason: Matches selected tags (Multilingual).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Expert Verification Automatic Metrics Multilingual
  • Trained on 32.7 million triplet samples drawn from 67 million toponyms spanning GeoNames, Wikidata, and the Getty Thesaurus of Geographic Names, the Student achieves the highest Recall@1 (85.2%) and Mean Reciprocal Rank (90.8%) on the…
  • The approach naturally handles pre-standardisation orthographic variation characteristic of historical documents, and transfers effectively to personal names in archival sources, suggesting broad applicability to name resolution tasks in…
Open paper

Match reason: Matches selected tags (Multilingual).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics Multilingual
  • Prior work has identified language-related neurons mainly through activation-based heuristics, which conflate language preference with functional importance.
  • Experiments on English, Chinese, and Vietnamese across multiple benchmarks, together with a dedicated relevance-based metric and base-to-chat model transfer analysis, show that CRANE isolates language-specific components more precisely than…
Open paper
Unlocking Reasoning Capability on Machine Translation in Large Language Models

Sara Rajaee, Sebastian Vincent, Alexandre Berard, Marzieh Fadaee, Kelly Marchisio, Tom Kocmi · Feb 16, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready
Critique Edit Long Horizon MathCoding
  • We systematically evaluate several open- and closed-weights RLMs on the WMT24++ benchmark and find that enabling explicit reasoning consistently degrades translation quality across languages and models.
Open paper
SAMAS: A Spectrum-Guided Multi-Agent System for Achieving Style Fidelity in Literary Translation

Jingzhuo Wu, Jiajun Zhang, Keyan Jin, Dehua Ma, Junbo Wang · Feb 23, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Multi Agent Multilingual
  • To address this, we introduce the Style-Adaptive Multi-Agent System (SAMAS), a novel framework that treats style preservation as a signal processing task.
  • Extensive experiments on translation benchmarks show that SAMAS achieves competitive semantic accuracy against strong baselines, primarily by leveraging its statistically significant advantage in style fidelity.
Open paper

Match reason: Matches selected tags (Multilingual).

Score: 58% High protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Long Horizon Multilingual
  • Building on the information bottleneck principle, we conceptualize explanations as compressed representations that retain only the information essential for producing correct answers.To operationalize this view, we introduce an evaluation…
Open paper
Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Tool Use Multilingual
  • On benchmarks spanning city names, person names, organizations, multilingual political parties, and bibliographic records, EnsembleLink matches or exceeds methods requiring extensive labeling.
Open paper
JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models

Junyu Liu, Zirui Li, Qian Niu, Zequn Zhang, Yue Xun, Wenlong Hou · Jan 4, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 55% Moderate protocol signal Freshness: Warm Status: Fallback
Red Team MedicineMultilingual
  • To address these gaps, we introduce JMedEthicBench, the first multi-turn conversational benchmark for evaluating medical safety of LLMs for Japanese healthcare.
  • Using a dual-LLM scoring protocol, we evaluate 27 models and find that commercial models maintain robust safety while medical-specialized models exhibit increased vulnerability.
Open paper
ExpLang: Improved Exploration and Exploitation in LLM Reasoning with On-Policy Thinking Language Selection

Changjiang Gao, Zixian Huang, Kaichen Yang, Jiajun Chen, Jixing Li, Shujian Huang · Feb 25, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise Preference Multilingual
  • Analysis shows that, by enabling on-policy thinking language selection as an action during RL, ExpLang effectively extends the RL exploration space with diversified language preference and improves the RL exploitation outcome with leveraged…
Open paper
IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages

Priyaranjan Pattnayak, Sanchari Chowdhuri · Feb 18, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Red Team CodingMultilingual
  • Safety alignment of large language models (LLMs) is mostly evaluated in English and contract-bound, leaving multilingual vulnerabilities understudied.
  • We introduce Indic Jailbreak Robustness (IJR), a judge-free benchmark for adversarial safety across 12 Indic and South Asian languages (2.1 Billion speakers), covering 45216 prompts in JSON (contract-bound) and Free (naturalistic) tracks.
Open paper
Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment

Yuyan Bu, Xiaohao Liu, ZhaoXing Ren, Yaodong Yang, Juntao Dai · Feb 18, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise Preference Multilingual
  • The widespread deployment of large language models (LLMs) across linguistic communities necessitates reliable multilingual safety alignment.
  • In this work, we propose a resource-efficient method for improving multilingual safety alignment.
Open paper
Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

Nivya Talokar, Ayush K Tarun, Murari Mandal, Maksym Andriushchenko, Antoine Bosselut · Feb 18, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Red Team LawMultilingual
  • LLM-based agents execute real-world workflows via tools and memory.
  • We introduce STING (Sequential Testing of Illicit N-step Goal execution), an automated red-teaming framework that constructs a step-by-step illicit plan grounded in a benign persona and iteratively probes a target agent with adaptive…
Open paper
A Parallel Cross-Lingual Benchmark for Multimodal Idiomaticity Understanding

Dilara Torunoğlu-Selamet, Dogukan Arslan, Rodrigo Wilkens, Wei He, Doruk Eryiğit, Thomas Pickard · Jan 13, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise Preference Multilingual
  • The dataset, containing 34 languages and over ten thousand items, allows comparative analyses of idiomatic patterns among language-specific realisations and preferences in order to gather insights about shared cultural aspects.
  • The result is a high-quality benchmark for evaluating multilingual and multimodal idiomatic language understanding.
Open paper
Beyond Literal Mapping: Benchmarking and Improving Non-Literal Translation Evaluation

Yanzhi Tian, Cunxiang Wang, Zeming Liu, Heyan Huang, Wenbo Yu, Dawei Song · Jan 12, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Llm As JudgeAutomatic Metrics CodingMultilingual
  • To systematically investigate the reliability of MT metrics, we first curate a meta-evaluation dataset focused on non-literal translations, namely MENT.
  • To mitigate these limitations, we propose RATE, a novel agentic translation evaluation framework, centered by a reflective Core Agent that dynamically invokes specialized sub-agents.
Open paper
Enhancing Multilingual RAG Systems with Debiased Language Preference-Guided Query Fusion

Jeonghyun Park, Byeongjeong Kim, Seojin Hwang, Hwanhee Lee · Jan 6, 2026

Citations: 0

Match reason: Matches selected tags (Multilingual).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise Preference Multilingual
  • To address these biases, we propose DeLP (Debiased Language Preference), a calibrated metric designed to explicitly factor out these structural confounds.
  • Building on this insight, we introduce DELTA (DEbiased Language preference-guided Text Augmentation), a lightweight and efficient mRAG framework that strategically leverages monolingual alignment to optimize cross-lingual retrieval and…
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.