HFEPX Hub

Multilingual + Pairwise Preference (Last 120 Days)

Updated from current HFEPX corpus (Apr 27, 2026). 16 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Apr 27, 2026). 16 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Calibration. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Apr 2, 2026.

Papers: 16 Last published: Apr 2, 2026 Global RSS Tag RSS

MultilingualPairwise PreferenceLast 120d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (16) Replication-Ready Only (0)

High-Signal Coverage

100.0%

16 / 16 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

0 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
1 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Use this page for scouting only; collect additional papers before attempting replication-critical comparisons.

Need evaluators for this research workflow?

Post a Job →

Why This Matters For Eval Research

100% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 31.3% of papers in this hub.
long-horizon tasks appears in 12.5% of papers, indicating agentic evaluation demand.

Protocol Takeaways

Most common quality-control signal is rater calibration (6.3% of papers).
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Metric Interpretation

accuracy is reported in 12.5% of hub papers (2/16); compare with a secondary metric before ranking methods.
agreement is reported in 6.3% of hub papers (1/16); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Strong: Papers with explicit human feedback

Coverage is strong (100% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (6.3% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (0% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (31.3% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (12.5% vs 35% target).
Strong: Papers with known annotation unit

Coverage is strong (37.5% vs 35% target).

Strengths

Strong human-feedback signal (100% of papers).

Known Gaps

Only 6.3% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (12.5% coverage).
Benchmark coverage is thin (0% of papers mention benchmarks/datasets).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Track metric sensitivity by reporting both accuracy and agreement.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries (Expanded)

Recommended Queries

LLM-as-Judge Protocols Metric Slice: accuracy Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanes…

Highest protocol score with explicit human/eval signal.

Strongest recent paper

Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Contr…

Useful for current practice scanning; published Mar 6, 2026.

Strongest recent paper

Tutoring Large Language Models to be Domain-adaptive, Precise, and Sa…

Useful for current practice scanning; published Feb 14, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study
Apr 2, 2026 · Citations: 0 · Score: 6.0

HF: Pairwise Preference · Eval: Llm As Judge, Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR
Mar 6, 2026 · Citations: 0 · Score: 5.5

HF: Pairwise Preference · Eval: Not reported · Benchmark: Not Reported · Metric: Not Reported
Tutoring Large Language Models to be Domain-adaptive, Precise, and Safe
Feb 14, 2026 · Citations: 0 · Score: 5.5

HF: Pairwise Preference · Eval: Not reported · Benchmark: Not Reported · Metric: Precision
Semantic Alignment across Ancient Egyptian Language Stages via Normalization-Aware Multitask Learning
Mar 25, 2026 · Citations: 0 · Score: 5.5

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages
Feb 14, 2026 · Citations: 0 · Score: 5.5

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Toxicity
CRANE: Causal Relevance Analysis of Language-Specific Neurons in Multilingual Large Language Models
Jan 8, 2026 · Citations: 0 · Score: 5.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Relevance

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study Apr 2, 2026	Yes Pairwise Preference	Llm As Judge , Automatic Metrics	Not Reported	Accuracy	Not Reported
Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR Mar 6, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Calibration
Tutoring Large Language Models to be Domain-adaptive, Precise, and Safe Feb 14, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Precision	Not Reported
Semantic Alignment across Ancient Egyptian Language Stages via Normalization-Aware Multitask Learning Mar 25, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Accuracy	Not Reported
Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages Feb 14, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Toxicity	Not Reported
CRANE: Causal Relevance Analysis of Language-Specific Neurons in Multilingual Large Language Models Jan 8, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Relevance	Not Reported
Plausibility as Commonsense Reasoning: Humans Succeed, Large Language Models Do not Apr 6, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported
Rethinking Metrics for Lexical Semantic Change Detection Feb 17, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Not Reported	Not Reported
Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment Feb 18, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported
Cross-Preference Learning for Sentence-Level and Context-Aware Machine Translation Mar 26, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported
Multilingual KokoroChat: A Multi-LLM Ensemble Translation Method for Creating a Multilingual Counseling Dialogue Dataset Mar 24, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported
Gender Bias in MT for a Genderless Language: New Benchmarks for Basque Mar 9, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	Blinded Radiologist and LLM-Based Evaluation of LLM…	Do Compact SSL Backbones Matter for Audio Deepfake…	Tutoring Large Language Models to be Domain-adaptiv…
Human Feedback	Pairwise Preference	Pairwise Preference	Pairwise Preference
Evaluation Modes	Llm As Judge, Automatic Metrics	Not reported	Not reported
Benchmarks	Not reported	Not reported	Not reported
Metrics	Accuracy	Not reported	Precision
Quality Controls	Not reported	Calibration	Not reported
Rater Population	Domain Experts	Unknown	Unknown
Annotation Unit	Pairwise	Pairwise	Trajectory

Research Utility Snapshot

Human Feedback Mix

Pairwise Preference (16)

Evaluation Modes

Automatic Metrics (5)
Llm As Judge (1)

Top Benchmarks

Top Metrics

Accuracy (2)
Agreement (1)
Precision (1)
Relevance (1)

Rater Population Mix

Domain Experts (2)

Quality Controls

Calibration (1)

Coverage diagnostics (sample-based): human-feedback 100.0% · benchmarks 0.0% · metrics 31.3% · quality controls 6.3%.

Top Papers

Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study
Yosuke Yamagishi, Atsushi Takamatsu, Yasunori Hamaguchi, Tomohiro Kikuchi, Shouhei Hanaoka · Apr 2, 2026 · Citations: 0

Pairwise Preference Llm As JudgeAutomatic Metrics

A board-certified radiologist and a radiology resident independently performed blinded pairwise evaluations across 4 criteria: terminology accuracy, readability, overall quality, and radiologist-style authenticity.
Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR
Ajinkya Kulkarni, Sandipana Dowerah, Atharva Kulkarni, Tanel Alumäe, Mathew Magimai Doss · Mar 6, 2026 · Citations: 0

Pairwise Preference Long Horizon

We present RAPTOR, Representation Aware Pairwise-gated Transformer for Out-of-domain Recognition a controlled study of compact SSL backbones from the HuBERT and WavLM within a unified pairwise-gated fusion detector, evaluated across 14…
Tutoring Large Language Models to be Domain-adaptive, Precise, and Safe
Somnath Banerjee · Feb 14, 2026 · Citations: 0

Pairwise Preference Long Horizon

The methodological trajectory moves from classical supervised adaptation for task-specific demands to decoding-time alignment for safety, finally leveraging human feedback and preference modeling to achieve sociolinguistic acuity.
Semantic Alignment across Ancient Egyptian Language Stages via Normalization-Aware Multitask Learning
He Huang · Mar 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

We evaluate alignment quality using pairwise metrics, specifically ROC-AUC and triplet accuracy, on curated Egyptian-English and intra-Egyptian cognate datasets.
Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages
Somnath Banerjee, Rima Hazra, Animesh Mukherjee · Feb 14, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Yet safety pipelines, benchmarks, and alignment still largely target English and a handful of high-resource languages, implicitly assuming safety and factuality ''transfer'' across languages.
CRANE: Causal Relevance Analysis of Language-Specific Neurons in Multilingual Large Language Models
Yifan Le, Yunliang Li · Jan 8, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Prior work has identified language-related neurons mainly through activation-based heuristics, which conflate language preference with functional importance.
Rethinking Metrics for Lexical Semantic Change Detection
Roksana Goworek, Haim Dubossarsky · Feb 17, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Lexical semantic change detection (LSCD) increasingly relies on contextualised language model embeddings, yet most approaches still quantify change using a small set of semantic change metrics, primarily Average Pairwise Distance (APD) and
Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment
Yuyan Bu, Xiaohao Liu, ZhaoXing Ren, Yaodong Yang, Juntao Dai · Feb 18, 2026 · Citations: 0

Pairwise Preference

The widespread deployment of large language models (LLMs) across linguistic communities necessitates reliable multilingual safety alignment.
A Parallel Cross-Lingual Benchmark for Multimodal Idiomaticity Understanding
Dilara Torunoğlu-Selamet, Dogukan Arslan, Rodrigo Wilkens, Wei He, Doruk Eryiğit · Jan 13, 2026 · Citations: 0

Pairwise Preference

The dataset, containing 34 languages and over ten thousand items, allows comparative analyses of idiomatic patterns among language-specific realisations and preferences in order to gather insights about shared cultural aspects.
Plausibility as Commonsense Reasoning: Humans Succeed, Large Language Models Do not
Sercan Karakaş · Apr 6, 2026 · Citations: 0

Pairwise Preference

Large language models achieve strong performance on many language tasks, yet it remains unclear whether they integrate world knowledge with syntactic structure in a human-like, structure-sensitive way during ambiguity resolution.
Cross-Preference Learning for Sentence-Level and Context-Aware Machine Translation
Ying Li, Xinglin Lyu, Junhui Li, Jinlong Yang, Hengchao Shang · Mar 26, 2026 · Citations: 0

Pairwise Preference

In this paper, we propose Cross-Preference Learning (CPL), a preference-based training framework that explicitly captures the complementary benefits of sentence-level and context-aware MT.
Multilingual KokoroChat: A Multi-LLM Ensemble Translation Method for Creating a Multilingual Counseling Dialogue Dataset
Ryoma Suzuki, Zhiyang Qi, Michimasa Inaba · Mar 24, 2026 · Citations: 0

Pairwise Preference

The quality of ``Multilingual KokoroChat'' was rigorously validated through human preference studies.
Gender Bias in MT for a Genderless Language: New Benchmarks for Basque
Amaia Murillo, Olatz-Perez-de-Viñaspre, Naiara Perez · Mar 9, 2026 · Citations: 0

Pairwise Preference

WinoMTeus adapts the WinoMT benchmark to examine how gender-neutral Basque occupations are translated into gendered languages such as Spanish and French.
EstLLM: Enhancing Estonian Capabilities in Multilingual LLMs via Continued Pretraining and Post-Training
Aleksei Dorkin, Taido Purason, Emil Kalbaliyev, Hele-Andra Kuulmets, Marii Ojastu · Mar 2, 2026 · Citations: 0

Pairwise Preference

We subsequently apply supervised fine-tuning, preference optimization, and chat vector merging to introduce robust instruction-following behavior.
ExpLang: Improved Exploration and Exploitation in LLM Reasoning with On-Policy Thinking Language Selection
Changjiang Gao, Zixian Huang, Kaichen Yang, Jiajun Chen, Jixing Li · Feb 25, 2026 · Citations: 0

Pairwise Preference

Analysis shows that, by enabling on-policy thinking language selection as an action during RL, ExpLang effectively extends the RL exploration space with diversified language preference and improves the RL exploitation outcome with leveraged…
Enhancing Multilingual RAG Systems with Debiased Language Preference-Guided Query Fusion
Jeonghyun Park, Byeongjeong Kim, Seojin Hwang, Hwanhee Lee · Jan 6, 2026 · Citations: 0

Pairwise Preference

To address these biases, we propose DeLP (Debiased Language Preference), a calibrated metric designed to explicitly factor out these structural confounds.

Related Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now