HFEPX Hub

Multilingual + Pairwise Preference Papers

Updated from current HFEPX corpus (Apr 27, 2026). 21 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Apr 27, 2026). 21 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Calibration. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Apr 2, 2026.

Papers: 21 Last published: Apr 2, 2026 Global RSS Tag RSS

MultilingualPairwise Preference

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (21) Replication-Ready Only (0)

High-Signal Coverage

100.0%

21 / 21 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

0 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
2 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Use this page for scouting only; collect additional papers before attempting replication-critical comparisons.

Need evaluators for this research workflow?

Post a Job →

Why This Matters For Eval Research

100% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 33.3% of papers in this hub.
long-horizon tasks appears in 9.5% of papers, indicating agentic evaluation demand.

Protocol Takeaways

Most common quality-control signal is rater calibration (4.8% of papers).
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Metric Interpretation

accuracy is reported in 14.3% of hub papers (3/21); compare with a secondary metric before ranking methods.
agreement is reported in 9.5% of hub papers (2/21); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Strong: Papers with explicit human feedback

Coverage is strong (100% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (9.5% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (0% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (33.3% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (9.5% vs 35% target).
Strong: Papers with known annotation unit

Coverage is strong (42.9% vs 35% target).

Strengths

Strong human-feedback signal (100% of papers).

Known Gaps

Only 9.5% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (9.5% coverage).
Benchmark coverage is thin (0% of papers mention benchmarks/datasets).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Track metric sensitivity by reporting both accuracy and agreement.

Recommended Queries (Expanded)

Recommended Queries

LLM-as-Judge Protocols Metric Slice: accuracy IAA-Reported Evaluations Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

MENLO: From Preferences to Proficiency -- Evaluating and Modeling Nat…

Highest protocol score with explicit human/eval signal.

Strongest benchmark reference

Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanes…

Reported benchmark with accuracy gives a fast comparison anchor.

Strongest recent paper

Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Contr…

Useful for current practice scanning; published Mar 6, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages
Sep 30, 2025 · Citations: 0 · Score: 6.0

HF: Pairwise Preference, Rubric Rating · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Agreement
Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study
Apr 2, 2026 · Citations: 0 · Score: 5.5

HF: Pairwise Preference · Eval: Llm As Judge, Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR
Mar 6, 2026 · Citations: 0 · Score: 5.5

HF: Pairwise Preference · Eval: Not reported · Benchmark: Not Reported · Metric: Not Reported
Semantic Alignment across Ancient Egyptian Language Stages via Normalization-Aware Multitask Learning
Mar 25, 2026 · Citations: 0 · Score: 5.5

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
Tutoring Large Language Models to be Domain-adaptive, Precise, and Safe
Feb 14, 2026 · Citations: 0 · Score: 5.0

HF: Pairwise Preference · Eval: Not reported · Benchmark: Not Reported · Metric: Precision
Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages
Feb 14, 2026 · Citations: 0 · Score: 5.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Toxicity

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages Sep 30, 2025	Yes Pairwise Preference , Rubric Rating	Automatic Metrics	Not Reported	Agreement	Inter Annotator Agreement Reported
Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study Apr 2, 2026	Yes Pairwise Preference	Llm As Judge , Automatic Metrics	Not Reported	Accuracy	Not Reported
Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR Mar 6, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Calibration
Semantic Alignment across Ancient Egyptian Language Stages via Normalization-Aware Multitask Learning Mar 25, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Accuracy	Not Reported
Tutoring Large Language Models to be Domain-adaptive, Precise, and Safe Feb 14, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Precision	Not Reported
Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages Feb 14, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Toxicity	Not Reported
CRANE: Causal Relevance Analysis of Language-Specific Neurons in Multilingual Large Language Models Jan 8, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Relevance	Not Reported
MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining Jul 2, 2025	Yes Pairwise Preference	Automatic Metrics	Not Reported	Accuracy	Not Reported
Plausibility as Commonsense Reasoning: Humans Succeed, Large Language Models Do not Apr 6, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported
Cross-Preference Learning for Sentence-Level and Context-Aware Machine Translation Mar 26, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported
Multilingual KokoroChat: A Multi-LLM Ensemble Translation Method for Creating a Multilingual Counseling Dialogue Dataset Mar 24, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported
Gender Bias in MT for a Genderless Language: New Benchmarks for Basque Mar 9, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	MENLO: From Preferences to Proficiency -- Evaluatin…	Blinded Radiologist and LLM-Based Evaluation of LLM…	Do Compact SSL Backbones Matter for Audio Deepfake…
Human Feedback	Pairwise Preference, Rubric Rating	Pairwise Preference	Pairwise Preference
Evaluation Modes	Automatic Metrics	Llm As Judge, Automatic Metrics	Not reported
Benchmarks	Not reported	Not reported	Not reported
Metrics	Agreement	Accuracy	Not reported
Quality Controls	Inter Annotator Agreement Reported	Not reported	Calibration
Rater Population	Unknown	Domain Experts	Unknown
Annotation Unit	Pairwise	Pairwise	Pairwise

Research Utility Snapshot

Human Feedback Mix

Pairwise Preference (21)
Rubric Rating (1)

Evaluation Modes

Automatic Metrics (7)
Llm As Judge (2)

Top Benchmarks

Top Metrics

Accuracy (3)
Agreement (2)
Precision (1)
Relevance (1)

Rater Population Mix

Domain Experts (2)

Quality Controls

Calibration (1)
Inter Annotator Agreement Reported (1)

Coverage diagnostics (sample-based): human-feedback 100.0% · benchmarks 0.0% · metrics 33.3% · quality controls 9.5%.

Top Papers

Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study
Yosuke Yamagishi, Atsushi Takamatsu, Yasunori Hamaguchi, Tomohiro Kikuchi, Shouhei Hanaoka · Apr 2, 2026 · Citations: 0

Pairwise Preference Llm As JudgeAutomatic Metrics

A board-certified radiologist and a radiology resident independently performed blinded pairwise evaluations across 4 criteria: terminology accuracy, readability, overall quality, and radiologist-style authenticity.
MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages
Chenxi Whitehouse, Sebastian Ruder, Tony Lin, Oksana Kurylo, Haruka Takagi · Sep 30, 2025 · Citations: 0

Pairwise PreferenceRubric Rating Automatic Metrics

To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms.
Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR
Ajinkya Kulkarni, Sandipana Dowerah, Atharva Kulkarni, Tanel Alumäe, Mathew Magimai Doss · Mar 6, 2026 · Citations: 0

Pairwise Preference Long Horizon

We present RAPTOR, Representation Aware Pairwise-gated Transformer for Out-of-domain Recognition a controlled study of compact SSL backbones from the HuBERT and WavLM within a unified pairwise-gated fusion detector, evaluated across 14…
Tutoring Large Language Models to be Domain-adaptive, Precise, and Safe
Somnath Banerjee · Feb 14, 2026 · Citations: 0

Pairwise Preference Long Horizon

The methodological trajectory moves from classical supervised adaptation for task-specific demands to decoding-time alignment for safety, finally leveraging human feedback and preference modeling to achieve sociolinguistic acuity.
Penalizing Length: Uncovering Systematic Bias in Quality Estimation Metrics
Yilin Zhang, Wenda Xu, Zhongtao Liu, Tetsuji Nakagawa, Markus Freitag · Oct 24, 2025 · Citations: 0

Pairwise Preference Llm As Judge

Quality Estimation (QE) metrics are vital in machine translation for reference-free evaluation and increasingly serve as selection criteria in data filtering and candidate reranking.
Semantic Alignment across Ancient Egyptian Language Stages via Normalization-Aware Multitask Learning
He Huang · Mar 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

We evaluate alignment quality using pairwise metrics, specifically ROC-AUC and triplet accuracy, on curated Egyptian-English and intra-Egyptian cognate datasets.
MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining
Zhixun Chen, Ping Guo, Wenhan Han, Yifan Zhang, Binbin Liu · Jul 2, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

We introduce MuRating, a scalable framework that transfers high-quality English data-quality signals into a single rater for 17 target languages.
Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages
Somnath Banerjee, Rima Hazra, Animesh Mukherjee · Feb 14, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Yet safety pipelines, benchmarks, and alignment still largely target English and a handful of high-resource languages, implicitly assuming safety and factuality ''transfer'' across languages.
CRANE: Causal Relevance Analysis of Language-Specific Neurons in Multilingual Large Language Models
Yifan Le, Yunliang Li · Jan 8, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Prior work has identified language-related neurons mainly through activation-based heuristics, which conflate language preference with functional importance.
Rethinking Metrics for Lexical Semantic Change Detection
Roksana Goworek, Haim Dubossarsky · Feb 17, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Lexical semantic change detection (LSCD) increasingly relies on contextualised language model embeddings, yet most approaches still quantify change using a small set of semantic change metrics, primarily Average Pairwise Distance (APD) and
Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment
Yuyan Bu, Xiaohao Liu, ZhaoXing Ren, Yaodong Yang, Juntao Dai · Feb 18, 2026 · Citations: 0

Pairwise Preference

The widespread deployment of large language models (LLMs) across linguistic communities necessitates reliable multilingual safety alignment.
A Parallel Cross-Lingual Benchmark for Multimodal Idiomaticity Understanding
Dilara Torunoğlu-Selamet, Dogukan Arslan, Rodrigo Wilkens, Wei He, Doruk Eryiğit · Jan 13, 2026 · Citations: 0

Pairwise Preference

The dataset, containing 34 languages and over ten thousand items, allows comparative analyses of idiomatic patterns among language-specific realisations and preferences in order to gather insights about shared cultural aspects.
Plausibility as Commonsense Reasoning: Humans Succeed, Large Language Models Do not
Sercan Karakaş · Apr 6, 2026 · Citations: 0

Pairwise Preference

Large language models achieve strong performance on many language tasks, yet it remains unclear whether they integrate world knowledge with syntactic structure in a human-like, structure-sensitive way during ambiguity resolution.
Cross-Preference Learning for Sentence-Level and Context-Aware Machine Translation
Ying Li, Xinglin Lyu, Junhui Li, Jinlong Yang, Hengchao Shang · Mar 26, 2026 · Citations: 0

Pairwise Preference

In this paper, we propose Cross-Preference Learning (CPL), a preference-based training framework that explicitly captures the complementary benefits of sentence-level and context-aware MT.
Multilingual KokoroChat: A Multi-LLM Ensemble Translation Method for Creating a Multilingual Counseling Dialogue Dataset
Ryoma Suzuki, Zhiyang Qi, Michimasa Inaba · Mar 24, 2026 · Citations: 0

Pairwise Preference

The quality of ``Multilingual KokoroChat'' was rigorously validated through human preference studies.
Gender Bias in MT for a Genderless Language: New Benchmarks for Basque
Amaia Murillo, Olatz-Perez-de-Viñaspre, Naiara Perez · Mar 9, 2026 · Citations: 0

Pairwise Preference

WinoMTeus adapts the WinoMT benchmark to examine how gender-neutral Basque occupations are translated into gendered languages such as Spanish and French.
EstLLM: Enhancing Estonian Capabilities in Multilingual LLMs via Continued Pretraining and Post-Training
Aleksei Dorkin, Taido Purason, Emil Kalbaliyev, Hele-Andra Kuulmets, Marii Ojastu · Mar 2, 2026 · Citations: 0

Pairwise Preference

We subsequently apply supervised fine-tuning, preference optimization, and chat vector merging to introduce robust instruction-following behavior.
ExpLang: Improved Exploration and Exploitation in LLM Reasoning with On-Policy Thinking Language Selection
Changjiang Gao, Zixian Huang, Kaichen Yang, Jiajun Chen, Jixing Li · Feb 25, 2026 · Citations: 0

Pairwise Preference

Analysis shows that, by enabling on-policy thinking language selection as an action during RL, ExpLang effectively extends the RL exploration space with diversified language preference and improves the RL exploitation outcome with leveraged…
Enhancing Multilingual RAG Systems with Debiased Language Preference-Guided Query Fusion
Jeonghyun Park, Byeongjeong Kim, Seojin Hwang, Hwanhee Lee · Jan 6, 2026 · Citations: 0

Pairwise Preference

To address these biases, we propose DeLP (Debiased Language Preference), a calibrated metric designed to explicitly factor out these structural confounds.
Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages
David Samuel, Lilja Øvrelid, Erik Velldal, Andrey Kutuzov · Dec 9, 2025 · Citations: 0

Pairwise Preference

Preference optimization is now a well-researched topic, but previous work has mostly addressed models for English and Chinese.
Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque
Oscar Sainz, Naiara Perez, Julen Etxaniz, Joseba Fernandez de Landa, Itziar Aldabe · Jun 9, 2025 · Citations: 0

Pairwise Preference

We present a comprehensive set of experiments for Basque that systematically study different combinations of these components evaluated on benchmarks and human preferences from 1,680 participants.

Related Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now