HFEPX Hub

Automatic Metrics + Multilingual (Last 60 Days)

Updated from current HFEPX corpus (Mar 10, 2026). 10 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Mar 10, 2026). 10 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequently cited benchmark: ARC-Challenge. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 6, 2026.

Papers: 10 Last published: Mar 6, 2026 Global RSS Tag RSS

Automatic MetricsMultilingualLast 60d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (10) Replication-Ready Only (2)

High-Signal Coverage

100.0%

10 / 10 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

2 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
0 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Use this page for scouting only; collect additional papers before attempting replication-critical comparisons.

Need evaluators for this research workflow?

Post a Job →

Why This Matters For Eval Research

40% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 100% of papers in this hub.
ARC-Challenge is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Benchmark Interpretation

ARC-Challenge appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.
lit-ragbench appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 60% of hub papers (6/10); compare with a secondary metric before ranking methods.
conciseness is reported in 20% of hub papers (2/10); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Moderate: Papers with explicit human feedback

Coverage is usable but incomplete (40% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (20% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (90% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (20% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (10% vs 35% target).

Strengths

Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.
Agentic evaluation appears in 50% of papers.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (20% coverage).
Annotation unit is under-specified (10% coverage).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (ARC-Challenge vs lit-ragbench) before comparing methods.
Track metric sensitivity by reporting both accuracy and conciseness.

Recommended Queries (Expanded)

Recommended Queries

Judge vs Human Agreement Benchmark Slice: ARC-Challenge Metric Slice: accuracy Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

LIT-RAGBench: Benchmarking Generator Capabilities of Large Language M…

Highest protocol score with explicit human/eval signal plus lit-ragbench.

Strongest benchmark reference

MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cas…

Reported benchmark with accuracy gives a fast comparison anchor.

Strongest recent paper

Obscure but Effective: Classical Chinese Jailbreak Prompt Optimizatio…

Useful for current practice scanning; published Feb 26, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation
Mar 6, 2026 · Citations: 0 · Score: 6.0

HF: Not reported · Eval: Llm As Judge, Automatic Metrics · Benchmark: Lit Ragbench · Metric: Accuracy
MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models
Feb 25, 2026 · Citations: 0 · Score: 6.0

HF: Expert Verification · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search
Feb 26, 2026 · Citations: 0 · Score: 6.0

HF: Red Team · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages
Feb 14, 2026 · Citations: 0 · Score: 6.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Toxicity
The Sufficiency-Conciseness Trade-off in LLM Self-Explanation from an Information Bottleneck Perspective
Feb 15, 2026 · Citations: 0 · Score: 6.0

HF: Not reported · Eval: Automatic Metrics · Benchmark: ARC Challenge · Metric: Accuracy
Rethinking Metrics for Lexical Semantic Change Detection
Feb 17, 2026 · Citations: 0 · Score: 4.5

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Not Reported

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation Mar 6, 2026	No Not Reported	Llm As Judge , Automatic Metrics	Lit Ragbench	Accuracy	Not Reported
MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models Feb 25, 2026	Yes Expert Verification	Automatic Metrics	Not Reported	Accuracy	Not Reported
Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search Feb 26, 2026	Yes Red Team	Automatic Metrics	Not Reported	Accuracy , Conciseness	Not Reported
Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages Feb 14, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Toxicity	Not Reported
The Sufficiency-Conciseness Trade-off in LLM Self-Explanation from an Information Bottleneck Perspective Feb 15, 2026	No Not Reported	Automatic Metrics	ARC Challenge	Accuracy , Conciseness	Not Reported
Rethinking Metrics for Lexical Semantic Change Detection Feb 17, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Not Reported	Not Reported
Terminology Rarity Predicts Catastrophic Failure in LLM Translation of Low-Resource Ancient Languages: Evidence from Ancient Greek Feb 27, 2026	No Not Reported	Human Eval , Automatic Metrics	Not Reported	Bleu , Rouge	Not Reported
BLUFF: Benchmarking the Detection of False and Synthetic Content across 58 Low-Resource Languages Feb 28, 2026	No Not Reported	Automatic Metrics	Not Reported	F1	Not Reported
SAMAS: A Spectrum-Guided Multi-Agent System for Achieving Style Fidelity in Literary Translation Feb 23, 2026	No Not Reported	Automatic Metrics	Not Reported	Accuracy	Not Reported
EnsembleLink: Accurate Record Linkage Without Training Data Jan 29, 2026	No Not Reported	Automatic Metrics	Not Reported	Accuracy	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	LIT-RAGBench: Benchmarking Generator Capabilities o…	MEDSYN: Benchmarking Multi-EviDence SYNthesis in Co…	Obscure but Effective: Classical Chinese Jailbreak…
Human Feedback	Not reported	Expert Verification	Red Team
Evaluation Modes	Llm As Judge, Automatic Metrics	Automatic Metrics	Automatic Metrics
Benchmarks	Lit Ragbench	Not reported	Not reported
Metrics	Accuracy	Accuracy	Accuracy, Conciseness
Quality Controls	Not reported	Not reported	Not reported
Rater Population	Unknown	Domain Experts	Unknown
Annotation Unit	Unknown	Unknown	Unknown

Research Utility Snapshot

Human Feedback Mix

Pairwise Preference (2)
Expert Verification (1)
Red Team (1)

Evaluation Modes

Automatic Metrics (10)
Human Eval (1)
Llm As Judge (1)

Top Benchmarks

ARC Challenge (1)
Lit Ragbench (1)

Top Metrics

Accuracy (6)
Conciseness (2)
Bertscore (1)
Bleu (1)

Rater Population Mix

Domain Experts (2)

Quality Controls

Coverage diagnostics (sample-based): human-feedback 40.0% · benchmarks 20.0% · metrics 90.0% · quality controls 0.0%.

Top Papers

LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation
Koki Itai, Shunichi Hasegawa, Yuta Yamamoto, Gouki Minegishi, Masaki Otsuki · Mar 6, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics Long Horizon

To bridge the gap between existing evaluations and practical use, we introduce LIT-RAGBench (the Logic, Integration, Table, Reasoning, and Abstention RAG Generator Benchmark), which defines five categories: Integration, Reasoning, Logic,…
MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models
Boqi Chen, Xudong Liu, Jiachuan Peng, Marianne Frey-Marti, Bang Zheng · Feb 25, 2026 · Citations: 0

Expert Verification Automatic Metrics

Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity.
Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search
Xun Huang, Simeng Qin, Xiaoshuang Jia, Ranjie Duan, Huanqian Yan · Feb 26, 2026 · Citations: 0

Red Team Automatic Metrics

Owing to its conciseness and obscurity, classical Chinese can partially bypass existing safety constraints, exposing notable vulnerabilities in LLMs.
Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages
Somnath Banerjee, Rima Hazra, Animesh Mukherjee · Feb 14, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Yet safety pipelines, benchmarks, and alignment still largely target English and a handful of high-resource languages, implicitly assuming safety and factuality ''transfer'' across languages.
Rethinking Metrics for Lexical Semantic Change Detection
Roksana Goworek, Haim Dubossarsky · Feb 17, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Lexical semantic change detection (LSCD) increasingly relies on contextualised language model embeddings, yet most approaches still quantify change using a small set of semantic change metrics, primarily Average Pairwise Distance (APD) and
The Sufficiency-Conciseness Trade-off in LLM Self-Explanation from an Information Bottleneck Perspective
Ali Zahedzadeh, Behnam Bahrak · Feb 15, 2026 · Citations: 0

Automatic Metrics Long Horizon

Building on the information bottleneck principle, we conceptualize explanations as compressed representations that retain only the information essential for producing correct answers.To operationalize this view, we introduce an evaluation…
Terminology Rarity Predicts Catastrophic Failure in LLM Translation of Low-Resource Ancient Languages: Evidence from Ancient Greek
James L. Zainaldin, Cameron Pattison, Manuela Marai, Jacob Wu, Mark J. Schiefsky · Feb 27, 2026 · Citations: 0

Human EvalAutomatic Metrics

This study presents the first systematic, reference-free human evaluation of large language model (LLM) machine translation (MT) for Ancient Greek (AG) technical prose.
BLUFF: Benchmarking the Detection of False and Synthetic Content across 58 Low-Resource Languages
Jason Lucas, Matt Murtagh-White, Adaku Uchendu, Ali Al-Lawati, Michiharu Yamashita · Feb 28, 2026 · Citations: 0

Automatic Metrics Multi Agent

We introduce BLUFF, a comprehensive benchmark for detecting false and synthetic content, spanning 79 languages with over 202K samples, combining human-written fact-checked content (122K+ samples across 57 languages) and LLM-generated…
SAMAS: A Spectrum-Guided Multi-Agent System for Achieving Style Fidelity in Literary Translation
Jingzhuo Wu, Jiajun Zhang, Keyan Jin, Dehua Ma, Junbo Wang · Feb 23, 2026 · Citations: 0

Automatic Metrics Multi Agent

To address this, we introduce the Style-Adaptive Multi-Agent System (SAMAS), a novel framework that treats style preservation as a signal processing task.
EnsembleLink: Accurate Record Linkage Without Training Data
Noah Dasanaike · Jan 29, 2026 · Citations: 0

Automatic Metrics Tool Use

On benchmarks spanning city names, person names, organizations, multilingual political parties, and bibliographic records, EnsembleLink matches or exceeds methods requiring extensive labeling.

Related Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote