HFEPX Hub

Multilingual Papers (Last 60 Days)

Updated from current HFEPX corpus (Mar 1, 2026). 13 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Mar 1, 2026). 13 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequently cited benchmark: ARC-Challenge. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 14, 2026.

Papers: 13 Last published: Feb 14, 2026 Global RSS Tag RSS

MultilingualLast 60d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (13) Replication-Ready Only (1)

High-Signal Coverage

100.0%

13 / 13 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

1 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
0 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Use this page for scouting only; collect additional papers before attempting replication-critical comparisons.

Why This Matters (Expanded)

Why This Matters For Eval Research

84.6% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 46.2% of papers in this hub.
ARC-Challenge is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Notes (Expanded)

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.
Track metric sensitivity by reporting both accuracy and conciseness.

Benchmark Interpretation

ARC-Challenge appears in 7.7% of hub papers (1/13); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 30.8% of hub papers (4/13); compare with a secondary metric before ranking methods.
conciseness is reported in 15.4% of hub papers (2/13); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Strong: Papers with explicit human feedback

Coverage is strong (84.6% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (7.7% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (46.2% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (15.4% vs 35% target).
Moderate: Papers with known annotation unit

Coverage is usable but incomplete (23.1% vs 35% target).

Strengths

Strong human-feedback signal (84.6% of papers).
Agentic evaluation appears in 30.8% of papers.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (15.4% coverage).
Annotation unit is under-specified (23.1% coverage).

Suggested Next Analyses

Track metric sensitivity by reporting both accuracy and conciseness.

Recommended Queries (Expanded)

Recommended Queries

Benchmark Slice: ARC-Challenge Metric Slice: accuracy Recent High-Signal Papers

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

Tutoring Large Language Models to be Domain-adaptive, Precise, and Safe
Feb 14, 2026 · Citations: 0 · Score: 6.0

HF: Pairwise Preference · Eval: Not reported · Benchmark: Not Reported · Metric: Precision
MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models
Feb 25, 2026 · Citations: 0 · Score: 6.0

HF: Expert Verification · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search
Feb 26, 2026 · Citations: 0 · Score: 6.0

HF: Red Team · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages
Feb 14, 2026 · Citations: 0 · Score: 6.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Toxicity
The Sufficiency-Conciseness Trade-off in LLM Self-Explanation from an Information Bottleneck Perspective
Feb 15, 2026 · Citations: 0 · Score: 6.0

HF: Not reported · Eval: Automatic Metrics · Benchmark: ARC Challenge · Metric: Accuracy
Unlocking Reasoning Capability on Machine Translation in Large Language Models
Feb 16, 2026 · Citations: 0 · Score: 4.5

HF: Critique Edit · Eval: Not reported · Benchmark: Not Reported · Metric: Not Reported

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
Tutoring Large Language Models to be Domain-adaptive, Precise, and Safe Feb 14, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Precision	Not Reported
MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models Feb 25, 2026	Yes Expert Verification	Automatic Metrics	Not Reported	Accuracy	Not Reported
Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search Feb 26, 2026	Yes Red Team	Automatic Metrics	Not Reported	Accuracy , Conciseness	Not Reported
Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages Feb 14, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Toxicity	Not Reported
The Sufficiency-Conciseness Trade-off in LLM Self-Explanation from an Information Bottleneck Perspective Feb 15, 2026	No Not Reported	Automatic Metrics	ARC Challenge	Accuracy , Conciseness	Not Reported
Unlocking Reasoning Capability on Machine Translation in Large Language Models Feb 16, 2026	Yes Critique Edit	Not Reported	Not Reported	Not Reported	Not Reported
Rethinking Metrics for Lexical Semantic Change Detection Feb 17, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Not Reported	Not Reported
Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment Feb 18, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported
ExpLang: Improved Exploration and Exploitation in LLM Reasoning with On-Policy Thinking Language Selection Feb 25, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported
IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages Feb 18, 2026	Yes Red Team	Not Reported	Not Reported	Not Reported	Not Reported
Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents Feb 18, 2026	Yes Red Team	Not Reported	Not Reported	Not Reported	Not Reported
A Parallel Cross-Lingual Benchmark for Multimodal Idiomaticity Understanding Jan 13, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	Tutoring Large Language Models to be Domain-adaptiv…	MEDSYN: Benchmarking Multi-EviDence SYNthesis in Co…	Obscure but Effective: Classical Chinese Jailbreak…
Human Feedback	Pairwise Preference	Expert Verification	Red Team
Evaluation Modes	Not reported	Automatic Metrics	Automatic Metrics
Benchmarks	Not reported	Not reported	Not reported
Metrics	Precision	Accuracy	Accuracy, Conciseness
Quality Controls	Not reported	Not reported	Not reported
Rater Population	Unknown	Domain Experts	Unknown
Annotation Unit	Trajectory	Unknown	Unknown

Research Utility Snapshot

Human Feedback Mix

Pairwise Preference (6)
Red Team (3)
Critique Edit (1)
Expert Verification (1)

Evaluation Modes

Automatic Metrics (6)

Top Benchmarks

ARC Challenge (1)

Top Metrics

Accuracy (4)
Conciseness (2)
Precision (1)
Toxicity (1)

Rater Population Mix

Domain Experts (2)

Quality Controls

Coverage diagnostics (sample-based): human-feedback 84.6% · benchmarks 7.7% · metrics 46.2% · quality controls 0.0%.

Top Papers

Tutoring Large Language Models to be Domain-adaptive, Precise, and Safe
Somnath Banerjee · Feb 14, 2026 · Citations: 0

Pairwise Preference Long Horizon

The methodological trajectory moves from classical supervised adaptation for task-specific demands to decoding-time alignment for safety, finally leveraging human feedback and preference modeling to achieve sociolinguistic acuity.
MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models
Boqi Chen, Xudong Liu, Jiachuan Peng, Marianne Frey-Marti, Bang Zheng · Feb 25, 2026 · Citations: 0

Expert Verification Automatic Metrics

Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity.
Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search
Xun Huang, Simeng Qin, Xiaoshuang Jia, Ranjie Duan, Huanqian Yan · Feb 26, 2026 · Citations: 0

Red Team Automatic Metrics

Owing to its conciseness and obscurity, classical Chinese can partially bypass existing safety constraints, exposing notable vulnerabilities in LLMs.
Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages
Somnath Banerjee, Rima Hazra, Animesh Mukherjee · Feb 14, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Yet safety pipelines, benchmarks, and alignment still largely target English and a handful of high-resource languages, implicitly assuming safety and factuality ''transfer'' across languages.
Unlocking Reasoning Capability on Machine Translation in Large Language Models
Sara Rajaee, Sebastian Vincent, Alexandre Berard, Marzieh Fadaee, Kelly Marchisio · Feb 16, 2026 · Citations: 0

Critique Edit Long Horizon

We systematically evaluate several open- and closed-weights RLMs on the WMT24++ benchmark and find that enabling explicit reasoning consistently degrades translation quality across languages and models.
Rethinking Metrics for Lexical Semantic Change Detection
Roksana Goworek, Haim Dubossarsky · Feb 17, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Lexical semantic change detection (LSCD) increasingly relies on contextualised language model embeddings, yet most approaches still quantify change using a small set of semantic change metrics, primarily Average Pairwise Distance (APD) and
The Sufficiency-Conciseness Trade-off in LLM Self-Explanation from an Information Bottleneck Perspective
Ali Zahedzadeh, Behnam Bahrak · Feb 15, 2026 · Citations: 0

Automatic Metrics Long Horizon

Building on the information bottleneck principle, we conceptualize explanations as compressed representations that retain only the information essential for producing correct answers.To operationalize this view, we introduce an evaluation…
Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment
Yuyan Bu, Xiaohao Liu, ZhaoXing Ren, Yaodong Yang, Juntao Dai · Feb 18, 2026 · Citations: 0

Pairwise Preference

The widespread deployment of large language models (LLMs) across linguistic communities necessitates reliable multilingual safety alignment.
A Parallel Cross-Lingual Benchmark for Multimodal Idiomaticity Understanding
Dilara Torunoğlu-Selamet, Dogukan Arslan, Rodrigo Wilkens, Wei He, Doruk Eryiğit · Jan 13, 2026 · Citations: 0

Pairwise Preference

The dataset, containing 34 languages and over ten thousand items, allows comparative analyses of idiomatic patterns among language-specific realisations and preferences in order to gather insights about shared cultural aspects.
SAMAS: A Spectrum-Guided Multi-Agent System for Achieving Style Fidelity in Literary Translation
Jingzhuo Wu, Jiajun Zhang, Keyan Jin, Dehua Ma, Junbo Wang · Feb 23, 2026 · Citations: 0

Automatic Metrics Multi Agent

To address this, we introduce the Style-Adaptive Multi-Agent System (SAMAS), a novel framework that treats style preservation as a signal processing task.
ExpLang: Improved Exploration and Exploitation in LLM Reasoning with On-Policy Thinking Language Selection
Changjiang Gao, Zixian Huang, Kaichen Yang, Jiajun Chen, Jixing Li · Feb 25, 2026 · Citations: 0

Pairwise Preference

Analysis shows that, by enabling on-policy thinking language selection as an action during RL, ExpLang effectively extends the RL exploration space with diversified language preference and improves the RL exploitation outcome with leveraged…
IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages
Priyaranjan Pattnayak, Sanchari Chowdhuri · Feb 18, 2026 · Citations: 0

Red Team

Safety alignment of large language models (LLMs) is mostly evaluated in English and contract-bound, leaving multilingual vulnerabilities understudied.
Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents
Nivya Talokar, Ayush K Tarun, Murari Mandal, Maksym Andriushchenko, Antoine Bosselut · Feb 18, 2026 · Citations: 0

Red Team

LLM-based agents execute real-world workflows via tools and memory.

Related Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote