HFEPX Hub

Automatic Metrics + General + Rubric Rating Papers

Updated from current HFEPX corpus (Apr 27, 2026). 16 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Apr 27, 2026). 16 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Common annotation unit: Multi Dim Rubric. Frequent quality control: Inter Annotator Agreement Reported. Frequently cited benchmark: Interaction2eval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Apr 8, 2026.

Papers: 16 Last published: Apr 8, 2026 Global RSS Tag RSS

Automatic MetricsGeneralRubric Rating

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Medium .

All Sampled Papers (16) Replication-Ready Only (5)

High-Signal Coverage

100.0%

16 / 16 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

5 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
3 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Start with the top 2 papers in “Start Here”, then validate assumptions in the protocol matrix.

Need evaluators for this research workflow?

Post a Job →

Why This Matters For Eval Research

100% of papers report explicit human-feedback signals, led by rubric ratings.
automatic metrics appears in 100% of papers in this hub.
Interaction2eval is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

Most common quality-control signal is inter-annotator agreement reporting (18.8% of papers).
Rater context is mostly domain experts, and annotation is commonly multi-dimensional rubrics; use this to scope replication staffing.
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Benchmark Interpretation

Interaction2eval appears in 6.3% of hub papers (1/16); use this cohort for benchmark-matched comparisons.
Olympiadbench appears in 6.3% of hub papers (1/16); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 56.3% of hub papers (9/16); compare with a secondary metric before ranking methods.
agreement is reported in 18.8% of hub papers (3/16); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Strong: Papers with explicit human feedback

Coverage is strong (100% vs 45% target).
Moderate: Papers reporting quality controls

Coverage is usable but incomplete (18.8% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (31.3% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (93.8% vs 35% target).
Strong: Papers with known rater population

Coverage is strong (37.5% vs 35% target).
Strong: Papers with known annotation unit

Coverage is strong (100% vs 35% target).

Strengths

Strong human-feedback signal (100% of papers).
Rater population and annotation-unit details are frequently specified.

Known Gaps

Only 18.8% of papers report quality controls; prioritize calibration/adjudication evidence.

Suggested Next Analyses

Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.
Stratify by benchmark (Interaction2eval vs Olympiadbench) before comparing methods.
Track metric sensitivity by reporting both accuracy and agreement.

Recommended Queries (Expanded)

Recommended Queries

Human Eval Protocols Benchmark Slice: Interaction2eval Metric Slice: accuracy IAA-Reported Evaluations Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

Personalized RewardBench: Evaluating Reward Models with Human Aligned…

Highest protocol score with explicit human/eval signal plus Rewardbench.

Strongest benchmark reference

Beyond Paper-to-Paper: Structured Profiling and Rubric Scoring for Pa…

Scirepeval with recall gives a fast comparison anchor.

Strongest recent paper

When AI Meets Early Childhood Education: Large Language Models as Ass…

Useful for current practice scanning; published Mar 25, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Apr 8, 2026 · Citations: 0 · Score: 7.5

HF: Pairwise Preference, Rubric Rating · Eval: Human Eval, Automatic Metrics · Benchmark: Rewardbench · Metric: Accuracy
Beyond Paper-to-Paper: Structured Profiling and Rubric Scoring for Paper-Reviewer Matching
Apr 7, 2026 · Citations: 0 · Score: 7.5

HF: Rubric Rating · Eval: Automatic Metrics · Benchmark: Scirepeval · Metric: Recall
When AI Meets Early Childhood Education: Large Language Models as Assessment Teammates in Chinese Preschools
Mar 25, 2026 · Citations: 0 · Score: 7.5

HF: Rubric Rating · Eval: Automatic Metrics · Benchmark: Interaction2eval · Metric: Agreement
Rethinking Atomic Decomposition for LLM Judges: A Prompt-Controlled Study of Reference-Grounded QA Evaluation
Mar 30, 2026 · Citations: 0 · Score: 7.5

HF: Rubric Rating · Eval: Automatic Metrics · Benchmark: TruthfulQA · Metric: Accuracy
Stabilizing Rubric Integration Training via Decoupled Advantage Normalization
Mar 27, 2026 · Citations: 0 · Score: 7.5

HF: Rubric Rating · Eval: Automatic Metrics · Benchmark: Olympiadbench · Metric: Accuracy
More Human, More Efficient: Aligning Annotations with Quantized SLMs
Apr 1, 2026 · Citations: 0 · Score: 7.0

HF: Rubric Rating · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Agreement

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization Apr 8, 2026	Yes Pairwise Preference , Rubric Rating	Human Eval , Automatic Metrics	Rewardbench	Accuracy , Helpfulness	Not Reported
Beyond Paper-to-Paper: Structured Profiling and Rubric Scoring for Paper-Reviewer Matching Apr 7, 2026	Yes Rubric Rating	Automatic Metrics	Scirepeval	Recall	Not Reported
When AI Meets Early Childhood Education: Large Language Models as Assessment Teammates in Chinese Preschools Mar 25, 2026	Yes Rubric Rating	Automatic Metrics	Interaction2eval	Agreement	Not Reported
Rethinking Atomic Decomposition for LLM Judges: A Prompt-Controlled Study of Reference-Grounded QA Evaluation Mar 30, 2026	Yes Rubric Rating	Automatic Metrics	TruthfulQA	Accuracy	Not Reported
Stabilizing Rubric Integration Training via Decoupled Advantage Normalization Mar 27, 2026	Yes Rubric Rating	Automatic Metrics	Olympiadbench	Accuracy	Not Reported
More Human, More Efficient: Aligning Annotations with Quantized SLMs Apr 1, 2026	Yes Rubric Rating	Automatic Metrics	Not Reported	Agreement	Inter Annotator Agreement Reported , Adjudication
Personalized Prediction of Perceived Message Effectiveness Using Large Language Model Based Digital Twins Feb 23, 2026	Yes Rubric Rating	Automatic Metrics	Not Reported	Accuracy , F1	Inter Annotator Agreement Reported
From Intuition to Calibrated Judgment: A Rubric-Based Expert-Panel Study of Human Detection of LLM-Generated Korean Text Jan 6, 2026	Yes Rubric Rating	Automatic Metrics	Not Reported	Accuracy , Agreement	Calibration , Inter Annotator Agreement Reported
CHiL(L)Grader: Calibrated Human-in-the-Loop Short-Answer Grading Mar 12, 2026	Yes Rubric Rating	Automatic Metrics	Not Reported	Accuracy	Not Reported
Decision-Level Ordinal Modeling for Multimodal Essay Scoring with Large Language Models Mar 16, 2026	Yes Rubric Rating	Automatic Metrics	Not Reported	Relevance	Not Reported
RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning Feb 25, 2026	Yes Rubric Rating	Automatic Metrics	Not Reported	Accuracy	Not Reported
Query-focused and Memory-aware Reranker for Long Context Processing Feb 12, 2026	Yes Rubric Rating	Automatic Metrics	Not Reported	Accuracy , Relevance	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	Personalized RewardBench: Evaluating Reward Models…	Beyond Paper-to-Paper: Structured Profiling and Rub…	When AI Meets Early Childhood Education: Large Lang…
Human Feedback	Pairwise Preference, Rubric Rating	Rubric Rating	Rubric Rating
Evaluation Modes	Human Eval, Automatic Metrics	Automatic Metrics	Automatic Metrics
Benchmarks	Rewardbench	Scirepeval	Interaction2eval
Metrics	Accuracy, Helpfulness	Recall	Agreement
Quality Controls	Not reported	Not reported	Not reported
Rater Population	Unknown	Domain Experts	Domain Experts
Annotation Unit	Pairwise	Multi Dim Rubric	Multi Dim Rubric

Research Utility Snapshot

Human Feedback Mix

Rubric Rating (16)
Pairwise Preference (2)
Critique Edit (1)

Evaluation Modes

Automatic Metrics (16)
Human Eval (1)

Top Benchmarks

Interaction2eval (1)
Olympiadbench (1)
Rewardbench (1)
Scirepeval (1)

Top Metrics

Accuracy (9)
Agreement (3)
Relevance (3)
Cost (2)

Rater Population Mix

Domain Experts (6)

Quality Controls

Inter Annotator Agreement Reported (3)
Adjudication (1)
Calibration (1)

Coverage diagnostics (sample-based): human-feedback 100.0% · benchmarks 31.3% · metrics 93.8% · quality controls 18.8%.

Top Papers

Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Qiyao Ma, Dechen Gao, Rui Cai, Boqi Zhao, Hanchu Zhou · Apr 8, 2026 · Citations: 0

Pairwise PreferenceRubric Rating Human EvalAutomatic Metrics

Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values.
More Human, More Efficient: Aligning Annotations with Quantized SLMs
Jiayu Wang, Junyoung Lee · Apr 1, 2026 · Citations: 0

Rubric Rating Automatic Metrics

As Large Language Model (LLM) capabilities advance, the demand for high-quality annotation of exponentially increasing text corpora has outpaced human capacity, leading to the widespread adoption of LLMs in automatic evaluation and…
From Intuition to Calibrated Judgment: A Rubric-Based Expert-Panel Study of Human Detection of LLM-Generated Korean Text
Shinwoo Park, Yo-Sub Han · Jan 6, 2026 · Citations: 0

Rubric Rating Automatic Metrics

Distinguishing human-written Korean text from fluent LLM outputs remains difficult even for trained readers, who can over-trust surface well-formedness.
Beyond Paper-to-Paper: Structured Profiling and Rubric Scoring for Paper-Reviewer Matching
Yicheng Pan, Zhiyuan Ning, Ludi Wang, Yi Du · Apr 7, 2026 · Citations: 0

Rubric Rating Automatic Metrics

To address this gap, we propose P2R, a training-free framework that shifts from implicit paper-to-paper matching to explicit profile-based matching.
When AI Meets Early Childhood Education: Large Language Models as Assessment Teammates in Chinese Preschools
Xingming Li, Runke Huang, Yanan Bao, Yuye Jin, Yuru Jiao · Mar 25, 2026 · Citations: 0

Rubric Rating Automatic Metrics

In this paper, we investigate whether AI can serve as a scalable assessment teammate by extracting structured quality indicators and validating their alignment with human expert judgments.
Personalized Prediction of Perceived Message Effectiveness Using Large Language Model Based Digital Twins
Jasmin Han, Janardan Devkota, Joseph Waring, Amanda Luken, Felix Naughton · Feb 23, 2026 · Citations: 0

Rubric Rating Automatic Metrics

Model performance was assessed on three held-out messages per participant using accuracy, Cohen's kappa, and F1.
Role-Augmented Intent-Driven Generative Search Engine Optimization
Xiaolu Chen, Haojie Wu, Jie Bao, Zhen Chen, Yong Liao · Aug 15, 2025 · Citations: 0

Rubric Rating Automatic Metrics Web Browsing

To better evaluate the method under realistic settings, we address the benchmarking limitations of prior work by: (1) extending the GEO dataset with diversified query variations reflecting real-world search scenarios and (2) introducing…
Rethinking Atomic Decomposition for LLM Judges: A Prompt-Controlled Study of Reference-Grounded QA Evaluation
Xinran Zhang · Mar 30, 2026 · Citations: 0

Rubric Rating Automatic Metrics

Atomic decomposition -- breaking a candidate answer into claims before verifying each against a reference -- is a widely adopted design for LLM-based reference-grounded judges.
Stabilizing Rubric Integration Training via Decoupled Advantage Normalization
Zelin Tan, Zhouliang Yu, Bohan Lin, Zijie Geng, Hejia Geng · Mar 27, 2026 · Citations: 0

Rubric Rating Automatic Metrics

We propose Process-Aware Policy Optimization (PAPO), a method that integrates process-level evaluation into Group Relative Policy Optimization (GRPO) through decoupled advantage normalization, to address two limitations of existing reward…
CHiL(L)Grader: Calibrated Human-in-the-Loop Short-Answer Grading
Pranav Raikote, Korbinian Randl, Ioanna Miliou, Athanasios Lakes, Panagiotis Papapetrou · Mar 12, 2026 · Citations: 0

Rubric Rating Automatic Metrics

We introduce CHiL(L)Grader, the first automated grading framework that incorporates calibrated confidence estimation into a human-in-the-loop workflow.
PrefDisco: Benchmarking Proactive Personalized Reasoning
Shuyue Stella Li, Avinandan Bose, Faeze Brahman, Simon Shaolei Du, Pang Wei Koh · Sep 30, 2025 · Citations: 0

Pairwise PreferenceRubric Rating Automatic Metrics

We introduce PrefDisco, an evaluation methodology that transforms static benchmarks into interactive personalization tasks using psychologically-grounded personas with sparse, context-dependent preferences, and define PrefAlign as a…
Decision-Level Ordinal Modeling for Multimodal Essay Scoring with Large Language Models
Han Zhang, Jiamin Su, Li liu · Mar 16, 2026 · Citations: 0

Rubric Rating Automatic Metrics

Experiments on the multimodal EssayJudge dataset show that DLOM improves over a generation-based SFT baseline across scoring traits, and DLOM-GF yields further gains when modality relevance is heterogeneous.
RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning
Yukun Chen, Jiaming Li, Longze Chen, Ze Gong, Jingpeng Li · Feb 25, 2026 · Citations: 0

Rubric Rating Automatic Metrics

Extensive experiments on various visual reasoning benchmarks show that RuCL yields a remarkable +7.83% average improvement over the Qwen2.5-VL-7B model, achieving a state-of-the-art accuracy of 60.06%.
Query-focused and Memory-aware Reranker for Long Context Processing
Yuqing Li, Jiangnan Li, Mo Yu, Guoxuan Ding, Zheng Lin · Feb 12, 2026 · Citations: 0

Rubric Rating Automatic Metrics

It further establishes a new state-of-the-art on the LoCoMo benchmark that assesses the capabilities of dialogue understanding and memory usage.
Distilling Feedback into Memory-as-a-Tool
Víctor Gallego · Jan 9, 2026 · Citations: 0

Rubric RatingCritique Edit Automatic Metrics

We propose a framework that amortizes the cost of inference-time reasoning by converting transient critiques into retrievable guidelines, through a file-based memory system and agent-controlled tool calls.
Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training
Junkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong · Sep 25, 2025 · Citations: 0

Rubric Rating Automatic Metrics

Reinforcement fine-tuning (RFT) often suffers from reward over-optimization, where a policy model hacks the reward signals to achieve high scores while producing low-quality outputs.

Related Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now