HFEPX Hub

Automatic Metrics + Multilingual (Last 90 Days)

Updated from current HFEPX corpus (Apr 27, 2026). 18 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Apr 27, 2026). 18 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Adjudication. Frequently cited benchmark: ARC-Challenge. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Apr 2, 2026.

Papers: 18 Last published: Apr 2, 2026 Global RSS Tag RSS

Automatic MetricsMultilingualLast 90d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (18) Replication-Ready Only (2)

High-Signal Coverage

100.0%

18 / 18 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

2 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
1 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Use this page for scouting only; collect additional papers before attempting replication-critical comparisons.

Need evaluators for this research workflow?

Post a Job →

Why This Matters For Eval Research

44.4% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 100% of papers in this hub.
ARC-Challenge is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

Most common quality-control signal is adjudication (5.6% of papers).
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Benchmark Interpretation

ARC-Challenge appears in 5.6% of hub papers (1/18); use this cohort for benchmark-matched comparisons.
lit-ragbench appears in 5.6% of hub papers (1/18); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 55.6% of hub papers (10/18); compare with a secondary metric before ranking methods.
bleu is reported in 16.7% of hub papers (3/18); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Moderate: Papers with explicit human feedback

Coverage is usable but incomplete (44.4% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (5.6% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (11.1% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (94.4% vs 35% target).
Moderate: Papers with known rater population

Coverage is usable but incomplete (27.8% vs 35% target).
Moderate: Papers with known annotation unit

Coverage is usable but incomplete (27.8% vs 35% target).

Strengths

Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.
Agentic evaluation appears in 38.9% of papers.

Known Gaps

Only 5.6% of papers report quality controls; prioritize calibration/adjudication evidence.
Benchmark coverage is thin (11.1% of papers mention benchmarks/datasets).
LLM-as-judge appears without enough inter-annotator agreement reporting.

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (ARC-Challenge vs lit-ragbench) before comparing methods.
Track metric sensitivity by reporting both accuracy and bleu.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries (Expanded)

Recommended Queries

Judge vs Human Agreement Benchmark Slice: ARC-Challenge Metric Slice: accuracy Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

A Multi-Stage Validation Framework for Trustworthy Large-scale Clinic…

Highest protocol score with explicit human/eval signal.

Strongest benchmark reference

Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanes…

Reported benchmark with accuracy gives a fast comparison anchor.

Strongest recent paper

LIT-RAGBench: Benchmarking Generator Capabilities of Large Language M…

Useful for current practice scanning; published Mar 6, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models
Apr 7, 2026 · Citations: 0 · Score: 7.5

HF: Expert Verification · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: F1
Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study
Apr 2, 2026 · Citations: 0 · Score: 6.0

HF: Pairwise Preference · Eval: Llm As Judge, Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation
Mar 6, 2026 · Citations: 0 · Score: 5.5

HF: Not reported · Eval: Llm As Judge, Automatic Metrics · Benchmark: Lit Ragbench · Metric: Accuracy
Semantic Alignment across Ancient Egyptian Language Stages via Normalization-Aware Multitask Learning
Mar 25, 2026 · Citations: 0 · Score: 5.5

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
A prospective clinical feasibility study of a conversational diagnostic AI in an ambulatory primary care clinic
Mar 9, 2026 · Citations: 0 · Score: 5.5

HF: Expert Verification · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models
Feb 25, 2026 · Citations: 0 · Score: 5.5

HF: Expert Verification · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models Apr 7, 2026	Yes Expert Verification	Automatic Metrics	Not Reported	F1 , Agreement	Calibration , Adjudication
Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study Apr 2, 2026	Yes Pairwise Preference	Llm As Judge , Automatic Metrics	Not Reported	Accuracy	Not Reported
LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation Mar 6, 2026	No Not Reported	Llm As Judge , Automatic Metrics	Lit Ragbench	Accuracy	Not Reported
Semantic Alignment across Ancient Egyptian Language Stages via Normalization-Aware Multitask Learning Mar 25, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Accuracy	Not Reported
A prospective clinical feasibility study of a conversational diagnostic AI in an ambulatory primary care clinic Mar 9, 2026	Yes Expert Verification	Automatic Metrics	Not Reported	Accuracy	Not Reported
MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models Feb 25, 2026	Yes Expert Verification	Automatic Metrics	Not Reported	Accuracy	Not Reported
Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search Feb 26, 2026	Yes Red Team	Automatic Metrics	Not Reported	Accuracy , Conciseness	Not Reported
Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages Feb 14, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Toxicity	Not Reported
The Sufficiency-Conciseness Trade-off in LLM Self-Explanation from an Information Bottleneck Perspective Feb 15, 2026	No Not Reported	Automatic Metrics	ARC Challenge	Accuracy , Conciseness	Not Reported
Rethinking Metrics for Lexical Semantic Change Detection Feb 17, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Not Reported	Not Reported
Evaluating LLM-Based Translation of a Low-Resource Technical Language: The Medical and Philosophical Greek of Galen Feb 27, 2026	No Not Reported	Human Eval , Automatic Metrics	Not Reported	Bleu , Rouge	Not Reported
Progressive Training for Explainable Citation-Grounded Dialogue: Reducing Hallucination to Zero in English-Hindi LLMs Mar 19, 2026	No Not Reported	Automatic Metrics	Not Reported	F1 , Bleu	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	A Multi-Stage Validation Framework for Trustworthy…	Blinded Radiologist and LLM-Based Evaluation of LLM…	LIT-RAGBench: Benchmarking Generator Capabilities o…
Human Feedback	Expert Verification	Pairwise Preference	Not reported
Evaluation Modes	Automatic Metrics	Llm As Judge, Automatic Metrics	Llm As Judge, Automatic Metrics
Benchmarks	Not reported	Not reported	Lit Ragbench
Metrics	F1, Agreement	Accuracy	Accuracy
Quality Controls	Calibration, Adjudication	Not reported	Not reported
Rater Population	Domain Experts	Domain Experts	Unknown
Annotation Unit	Unknown	Pairwise	Unknown

Research Utility Snapshot

Human Feedback Mix

Pairwise Preference (4)
Expert Verification (3)
Red Team (1)

Evaluation Modes

Automatic Metrics (18)
Human Eval (3)
Llm As Judge (2)

Top Benchmarks

ARC Challenge (1)
Lit Ragbench (1)

Top Metrics

Accuracy (10)
Bleu (3)
F1 (3)
Agreement (2)

Rater Population Mix

Domain Experts (5)

Quality Controls

Adjudication (1)
Calibration (1)

Coverage diagnostics (sample-based): human-feedback 44.4% · benchmarks 11.1% · metrics 94.4% · quality controls 5.6%.

Top Papers

Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study
Yosuke Yamagishi, Atsushi Takamatsu, Yasunori Hamaguchi, Tomohiro Kikuchi, Shouhei Hanaoka · Apr 2, 2026 · Citations: 0

Pairwise Preference Llm As JudgeAutomatic Metrics

A board-certified radiologist and a radiology resident independently performed blinded pairwise evaluations across 4 criteria: terminology accuracy, readability, overall quality, and radiologist-style authenticity.
A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models
Maria Mahbub, Gregory M. Dams, Josh Arnold, Caitlin Rizy, Sudarshan Srinivasan · Apr 7, 2026 · Citations: 0

Expert Verification Automatic Metrics

Conventional evaluation methods rely heavily on annotation-intensive reference standards or incomplete structured data, limiting feasibility at population scale.
LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation
Koki Itai, Shunichi Hasegawa, Yuta Yamamoto, Gouki Minegishi, Masaki Otsuki · Mar 6, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics Long Horizon

To bridge the gap between existing evaluations and practical use, we introduce LIT-RAGBench (the Logic, Integration, Table, Reasoning, and Abstention RAG Generator Benchmark), which defines five categories: Integration, Reasoning, Logic,…
Semantic Alignment across Ancient Egyptian Language Stages via Normalization-Aware Multitask Learning
He Huang · Mar 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

We evaluate alignment quality using pairwise metrics, specifically ROC-AUC and triplet accuracy, on curated Egyptian-English and intra-Egyptian cognate datasets.
A prospective clinical feasibility study of a conversational diagnostic AI in an ambulatory primary care clinic
Peter Brodeur, Jacob M. Koshy, Anil Palepu, Khaled Saab, Ava Homiar · Mar 9, 2026 · Citations: 0

Expert Verification Automatic Metrics

Translating these systems into clinical practice requires assessment in real-world workflows with rigorous safety oversight.
MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models
Boqi Chen, Xudong Liu, Jiachuan Peng, Marianne Frey-Marti, Bang Zheng · Feb 25, 2026 · Citations: 0

Expert Verification Automatic Metrics

Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity.
Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search
Xun Huang, Simeng Qin, Xiaoshuang Jia, Ranjie Duan, Huanqian Yan · Feb 26, 2026 · Citations: 0

Red Team Automatic Metrics

Owing to its conciseness and obscurity, classical Chinese can partially bypass existing safety constraints, exposing notable vulnerabilities in LLMs.
Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages
Somnath Banerjee, Rima Hazra, Animesh Mukherjee · Feb 14, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Yet safety pipelines, benchmarks, and alignment still largely target English and a handful of high-resource languages, implicitly assuming safety and factuality ''transfer'' across languages.
Rethinking Metrics for Lexical Semantic Change Detection
Roksana Goworek, Haim Dubossarsky · Feb 17, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Lexical semantic change detection (LSCD) increasingly relies on contextualised language model embeddings, yet most approaches still quantify change using a small set of semantic change metrics, primarily Average Pairwise Distance (APD) and
The Sufficiency-Conciseness Trade-off in LLM Self-Explanation from an Information Bottleneck Perspective
Ali Zahedzadeh, Behnam Bahrak · Feb 15, 2026 · Citations: 0

Automatic Metrics Long Horizon

Building on the information bottleneck principle, we conceptualize explanations as compressed representations that retain only the information essential for producing correct answers.To operationalize this view, we introduce an evaluation…
Evaluating LLM-Based Translation of a Low-Resource Technical Language: The Medical and Philosophical Greek of Galen
James L. Zainaldin, Cameron Pattison, Manuela Marai, Jacob Wu, Mark J. Schiefsky · Feb 27, 2026 · Citations: 0

Human EvalAutomatic Metrics

This study presents the first systematic, reference-free human evaluation of large language model (LLM) machine translation (MT) for Ancient Greek (AG) technical prose.
Progressive Training for Explainable Citation-Grounded Dialogue: Reducing Hallucination to Zero in English-Hindi LLMs
Vedant Pandya · Mar 19, 2026 · Citations: 0

Automatic Metrics Long Horizon

We present XKD-Dial, a progressive four-stage training pipeline for explainable, knowledge-grounded dialogue generation in a bilingual (English-Hindi) setting, comprising: (1) multilingual adaptation, (2) English dialogue SFT with citation…
Video-Based Reward Modeling for Computer-Use Agents
Linxin Song, Jieyu Zhang, Huanxin Sheng, Taiwei Shi, Gupta Rahul · Mar 10, 2026 · Citations: 0

Automatic Metrics Long Horizon

Computer-using agents (CUAs) are becoming increasingly capable; however, it remains difficult to scale evaluation of whether a trajectory truly fulfills a user instruction.
Voxtral TTS
Mistral-AI, :, Alexander H. Liu, Alexis Tacnet, Andy Ehrenberg · Mar 26, 2026 · Citations: 0

Human EvalAutomatic Metrics

In human evaluations conducted by native speakers, Voxtral TTS is preferred for multilingual voice cloning due to its naturalness and expressivity, achieving a 68.4\% win rate over ElevenLabs Flash v2.5.
Translation Asymmetry in LLMs as a Data Augmentation Factor: A Case Study for 6 Romansh Language Varieties
Jannis Vamvas, Ignacio Pérez Prat, Angela Heldstab, Dominic P. Fischer, Sina Ahmadi · Mar 26, 2026 · Citations: 0

Human EvalAutomatic Metrics

A human evaluation confirms that our experiments yield the first model that generates fluent translations in the individual Romansh varieties.
BLUFF: Benchmarking the Detection of False and Synthetic Content across 58 Low-Resource Languages
Jason Lucas, Matt Murtagh-White, Adaku Uchendu, Ali Al-Lawati, Michiharu Yamashita · Feb 28, 2026 · Citations: 0

Automatic Metrics Multi Agent

We introduce BLUFF, a comprehensive benchmark for detecting false and synthetic content, spanning 79 languages with over 202K samples, combining human-written fact-checked content (122K+ samples across 57 languages) and LLM-generated…
SAMAS: A Spectrum-Guided Multi-Agent System for Achieving Style Fidelity in Literary Translation
Jingzhuo Wu, Jiajun Zhang, Keyan Jin, Dehua Ma, Junbo Wang · Feb 23, 2026 · Citations: 0

Automatic Metrics Multi Agent

To address this, we introduce the Style-Adaptive Multi-Agent System (SAMAS), a novel framework that treats style preservation as a signal processing task.
EnsembleLink: Accurate Record Linkage Without Training Data
Noah Dasanaike · Jan 29, 2026 · Citations: 0

Automatic Metrics Tool Use

On benchmarks spanning city names, person names, organizations, multilingual political parties, and bibliographic records, EnsembleLink matches or exceeds methods requiring extensive labeling.

Related Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now