HFEPX Hub

Multilingual Papers (Last 30 Days)

Updated from current HFEPX corpus (Apr 17, 2026). 10 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Apr 17, 2026). 10 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Adjudication. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Apr 2, 2026.

Papers: 10 Last published: Apr 2, 2026 Global RSS Tag RSS

MultilingualLast 30d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (10) Replication-Ready Only (0)

High-Signal Coverage

100.0%

10 / 10 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

0 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
1 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Use this page for scouting only; collect additional papers before attempting replication-critical comparisons.

Need evaluators for this research workflow?

Post a Job →

Why This Matters For Eval Research

70% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 60% of papers in this hub.
long-horizon tasks appears in 10% of papers, indicating agentic evaluation demand.

Protocol Takeaways

Most common quality-control signal is adjudication (10% of papers).
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Metric Interpretation

accuracy is reported in 20% of hub papers (2/10); compare with a secondary metric before ranking methods.
agreement is reported in 20% of hub papers (2/10); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Strong: Papers with explicit human feedback

Coverage is strong (70% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (10% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (0% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (60% vs 35% target).
Moderate: Papers with known rater population

Coverage is usable but incomplete (30% vs 35% target).
Moderate: Papers with known annotation unit

Coverage is usable but incomplete (30% vs 35% target).

Strengths

Strong human-feedback signal (70% of papers).
Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.

Known Gaps

Only 10% of papers report quality controls; prioritize calibration/adjudication evidence.
Benchmark coverage is thin (0% of papers mention benchmarks/datasets).
LLM-as-judge appears without enough inter-annotator agreement reporting.

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Track metric sensitivity by reporting both accuracy and agreement.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries (Expanded)

Recommended Queries

Judge vs Human Agreement Metric Slice: accuracy Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

A Multi-Stage Validation Framework for Trustworthy Large-scale Clinic…

Highest protocol score with explicit human/eval signal.

Strongest benchmark reference

Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanes…

Reported benchmark with accuracy gives a fast comparison anchor.

Strongest recent paper

Semantic Alignment across Ancient Egyptian Language Stages via Normal…

Useful for current practice scanning; published Mar 25, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models
Apr 7, 2026 · Citations: 0 · Score: 7.5

HF: Expert Verification · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: F1
Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study
Apr 2, 2026 · Citations: 0 · Score: 6.0

HF: Pairwise Preference · Eval: Llm As Judge, Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
Semantic Alignment across Ancient Egyptian Language Stages via Normalization-Aware Multitask Learning
Mar 25, 2026 · Citations: 0 · Score: 6.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
To Write or to Automate Linguistic Prompts, That Is the Question
Mar 26, 2026 · Citations: 0 · Score: 4.5

HF: Expert Verification · Eval: Not reported · Benchmark: Not Reported · Metric: Not Reported
Plausibility as Commonsense Reasoning: Humans Succeed, Large Language Models Do not
Apr 6, 2026 · Citations: 0 · Score: 4.5

HF: Pairwise Preference · Eval: Not reported · Benchmark: Not Reported · Metric: Not Reported
Cross-Preference Learning for Sentence-Level and Context-Aware Machine Translation
Mar 26, 2026 · Citations: 0 · Score: 4.5

HF: Pairwise Preference · Eval: Not reported · Benchmark: Not Reported · Metric: Not Reported

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models Apr 7, 2026	Yes Expert Verification	Automatic Metrics	Not Reported	F1 , Agreement	Calibration , Adjudication
Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study Apr 2, 2026	Yes Pairwise Preference	Llm As Judge , Automatic Metrics	Not Reported	Accuracy	Not Reported
Semantic Alignment across Ancient Egyptian Language Stages via Normalization-Aware Multitask Learning Mar 25, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Accuracy	Not Reported
To Write or to Automate Linguistic Prompts, That Is the Question Mar 26, 2026	Yes Expert Verification	Not Reported	Not Reported	Not Reported	Not Reported
Plausibility as Commonsense Reasoning: Humans Succeed, Large Language Models Do not Apr 6, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported
Cross-Preference Learning for Sentence-Level and Context-Aware Machine Translation Mar 26, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported
Multilingual KokoroChat: A Multi-LLM Ensemble Translation Method for Creating a Multilingual Counseling Dialogue Dataset Mar 24, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported
Progressive Training for Explainable Citation-Grounded Dialogue: Reducing Hallucination to Zero in English-Hindi LLMs Mar 19, 2026	No Not Reported	Automatic Metrics	Not Reported	F1 , Bleu	Not Reported
Voxtral TTS Mar 26, 2026	No Not Reported	Human Eval , Automatic Metrics	Not Reported	Win rate	Not Reported
Translation Asymmetry in LLMs as a Data Augmentation Factor: A Case Study for 6 Romansh Language Varieties Mar 26, 2026	No Not Reported	Human Eval , Automatic Metrics	Not Reported	Bleu	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	A Multi-Stage Validation Framework for Trustworthy…	Blinded Radiologist and LLM-Based Evaluation of LLM…	Semantic Alignment across Ancient Egyptian Language…
Human Feedback	Expert Verification	Pairwise Preference	Pairwise Preference
Evaluation Modes	Automatic Metrics	Llm As Judge, Automatic Metrics	Automatic Metrics
Benchmarks	Not reported	Not reported	Not reported
Metrics	F1, Agreement	Accuracy	Accuracy
Quality Controls	Calibration, Adjudication	Not reported	Not reported
Rater Population	Domain Experts	Domain Experts	Unknown
Annotation Unit	Unknown	Pairwise	Pairwise

Research Utility Snapshot

Human Feedback Mix

Pairwise Preference (5)
Expert Verification (2)

Evaluation Modes

Automatic Metrics (6)
Human Eval (2)
Llm As Judge (1)

Top Benchmarks

Top Metrics

Accuracy (2)
Agreement (2)
Bleu (2)
F1 (2)

Rater Population Mix

Domain Experts (3)

Quality Controls

Adjudication (1)
Calibration (1)

Coverage diagnostics (sample-based): human-feedback 70.0% · benchmarks 0.0% · metrics 60.0% · quality controls 10.0%.

Top Papers

Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study
Yosuke Yamagishi, Atsushi Takamatsu, Yasunori Hamaguchi, Tomohiro Kikuchi, Shouhei Hanaoka · Apr 2, 2026 · Citations: 0

Pairwise Preference Llm As JudgeAutomatic Metrics

A board-certified radiologist and a radiology resident independently performed blinded pairwise evaluations across 4 criteria: terminology accuracy, readability, overall quality, and radiologist-style authenticity.
A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models
Maria Mahbub, Gregory M. Dams, Josh Arnold, Caitlin Rizy, Sudarshan Srinivasan · Apr 7, 2026 · Citations: 0

Expert Verification Automatic Metrics

Conventional evaluation methods rely heavily on annotation-intensive reference standards or incomplete structured data, limiting feasibility at population scale.
Semantic Alignment across Ancient Egyptian Language Stages via Normalization-Aware Multitask Learning
He Huang · Mar 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

We evaluate alignment quality using pairwise metrics, specifically ROC-AUC and triplet accuracy, on curated Egyptian-English and intra-Egyptian cognate datasets.
Progressive Training for Explainable Citation-Grounded Dialogue: Reducing Hallucination to Zero in English-Hindi LLMs
Vedant Pandya · Mar 19, 2026 · Citations: 0

Automatic Metrics Long Horizon

We present XKD-Dial, a progressive four-stage training pipeline for explainable, knowledge-grounded dialogue generation in a bilingual (English-Hindi) setting, comprising: (1) multilingual adaptation, (2) English dialogue SFT with citation…
To Write or to Automate Linguistic Prompts, That Is the Question
Marina Sánchez-Torrón, Daria Akselrod, Jason Rauchwerk · Mar 26, 2026 · Citations: 0

Expert Verification

We present the first systematic comparison of hand-crafted zero-shot expert prompts, base DSPy signatures, and GEPA-optimized DSPy signatures across translation, terminology insertion, and language quality assessment, evaluating five model…
Voxtral TTS
Mistral-AI, :, Alexander H. Liu, Alexis Tacnet, Andy Ehrenberg · Mar 26, 2026 · Citations: 0

Human EvalAutomatic Metrics

In human evaluations conducted by native speakers, Voxtral TTS is preferred for multilingual voice cloning due to its naturalness and expressivity, achieving a 68.4\% win rate over ElevenLabs Flash v2.5.
Translation Asymmetry in LLMs as a Data Augmentation Factor: A Case Study for 6 Romansh Language Varieties
Jannis Vamvas, Ignacio Pérez Prat, Angela Heldstab, Dominic P. Fischer, Sina Ahmadi · Mar 26, 2026 · Citations: 0

Human EvalAutomatic Metrics

A human evaluation confirms that our experiments yield the first model that generates fluent translations in the individual Romansh varieties.
Plausibility as Commonsense Reasoning: Humans Succeed, Large Language Models Do not
Sercan Karakaş · Apr 6, 2026 · Citations: 0

Pairwise Preference

Large language models achieve strong performance on many language tasks, yet it remains unclear whether they integrate world knowledge with syntactic structure in a human-like, structure-sensitive way during ambiguity resolution.
Cross-Preference Learning for Sentence-Level and Context-Aware Machine Translation
Ying Li, Xinglin Lyu, Junhui Li, Jinlong Yang, Hengchao Shang · Mar 26, 2026 · Citations: 0

Pairwise Preference

In this paper, we propose Cross-Preference Learning (CPL), a preference-based training framework that explicitly captures the complementary benefits of sentence-level and context-aware MT.
Multilingual KokoroChat: A Multi-LLM Ensemble Translation Method for Creating a Multilingual Counseling Dialogue Dataset
Ryoma Suzuki, Zhiyang Qi, Michimasa Inaba · Mar 24, 2026 · Citations: 0

Pairwise Preference

The quality of ``Multilingual KokoroChat'' was rigorously validated through human preference studies.

Related Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now