Skip to content
← Back to explorer

HFEPX Hub

CS.CL + Rubric Rating Papers

Updated from current HFEPX corpus (Feb 27, 2026). 16 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Common annotation unit: Multi Dim Rubric. Frequent quality control: Inter Annotator Agreement Reported. Frequently cited benchmark: Caparena. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 25, 2026.

Papers: 16 Last published: Feb 25, 2026 Global RSS Tag RSS
Cs.CLRubric Rating

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 16 papers for CS.CL + Rubric Rating Papers. Dominant protocol signals include automatic metrics, human evaluation, LLM-as-judge, with frequent benchmark focus on Caparena, LongBench and metric focus on accuracy, agreement. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

  • Caparena appears in 6.3% of hub papers (1/16); use this cohort for benchmark-matched comparisons.
  • LongBench appears in 6.3% of hub papers (1/16); use this cohort for benchmark-matched comparisons.

Metric Interpretation

  • accuracy is reported in 18.8% of hub papers (3/16); compare with a secondary metric before ranking methods.
  • agreement is reported in 12.5% of hub papers (2/16); compare with a secondary metric before ranking methods.

Researcher Checklist

  • Maintain strength on Papers with explicit human feedback. Coverage is strong (100% vs 45% target).
  • Tighten coverage on Papers reporting quality controls. Coverage is usable but incomplete (18.8% vs 30% target).
  • Close gap on Papers naming benchmarks/datasets. Coverage is a replication risk (18.8% vs 35% target).
  • Maintain strength on Papers naming evaluation metrics. Coverage is strong (43.8% vs 35% target).
  • Maintain strength on Papers with known rater population. Coverage is strong (43.8% vs 35% target).
  • Maintain strength on Papers with known annotation unit. Coverage is strong (100% vs 35% target).

Papers with explicit human feedback

Coverage is strong (100% vs 45% target).

Papers reporting quality controls

Coverage is usable but incomplete (18.8% vs 30% target).

Papers naming benchmarks/datasets

Coverage is a replication risk (18.8% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (43.8% vs 35% target).

Papers with known rater population

Coverage is strong (43.8% vs 35% target).

Papers with known annotation unit

Coverage is strong (100% vs 35% target).

Suggested Reading Order

  1. 1. Cross-Cultural Expert-Level Art Critique Evaluation with Vision-Language Models

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  2. 2. RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning

    High citation traction makes this a useful baseline for method and protocol context.

  3. 3. SibylSense: Adaptive Rubric Learning via Memory Tuning and Adversarial Probing

    High citation traction makes this a useful baseline for method and protocol context.

  4. 4. Personalized Prediction of Perceived Message Effectiveness Using Large Language Model Based Digital Twins

    High citation traction makes this a useful baseline for method and protocol context.

  5. 5. Discovering Implicit Large Language Model Alignment Objectives

    Include a human-eval paper to anchor calibration against automated judge settings.

  6. 6. Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation

    Include a human-eval paper to anchor calibration against automated judge settings.

  7. 7. HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue

    Include an LLM-as-judge paper to assess judge design and agreement assumptions.

  8. 8. PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions

    Include an LLM-as-judge paper to assess judge design and agreement assumptions.

Known Limitations

  • Only 18.8% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Benchmark coverage is thin (18.8% of papers mention benchmarks/datasets).
  • Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

human_eval vs llm_as_judge

both=2, left_only=3, right_only=0

2 papers use both Human Eval and Llm As Judge.

human_eval vs automatic_metrics

both=0, left_only=5, right_only=10

0 papers use both Human Eval and Automatic Metrics.

llm_as_judge vs automatic_metrics

both=0, left_only=2, right_only=10

0 papers use both Llm As Judge and Automatic Metrics.

Benchmark Brief

Caparena

Coverage: 1 papers (6.3%)

1 papers (6.3%) mention Caparena.

Examples: PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions

Benchmark Brief

LongBench

Coverage: 1 papers (6.3%)

1 papers (6.3%) mention LongBench.

Examples: Document Reconstruction Unlocks Scalable Long-Context RLVR

Benchmark Brief

Mle-Bench

Coverage: 1 papers (6.3%)

1 papers (6.3%) mention Mle-Bench.

Examples: KLong: Training LLM Agent for Extremely Long-horizon Tasks

Metric Brief

agreement

Coverage: 2 papers (12.5%)

2 papers (12.5%) mention agreement.

Examples: HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue , A Scalable Framework for Evaluating Health Language Models

Metric Brief

coherence

Coverage: 1 papers (6.3%)

1 papers (6.3%) mention coherence.

Examples: Document Reconstruction Unlocks Scalable Long-Context RLVR

Top Papers

  • RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning

    Yukun Chen, Jiaming Li, Longze Chen, Ze Gong, Jingpeng Li · Feb 25, 2026 · Citations: 0

    Rubric Rating Automatic Metrics

    Extensive experiments on various visual reasoning benchmarks show that RuCL yields a remarkable +7.83% average improvement over the Qwen2.5-VL-7B model, achieving a state-of-the-art accuracy of 60.06%.

  • SibylSense: Adaptive Rubric Learning via Memory Tuning and Adversarial Probing

    Yifei Xu, Guilherme Potje, Shivam Shandilya, Tiancheng Yuan, Leonardo de Oliveira Nunes · Feb 24, 2026 · Citations: 0

    Rubric RatingRed Team Automatic Metrics

    Designing aligned and robust rewards for open-ended generation remains a key barrier to RL post-training.

  • Personalized Prediction of Perceived Message Effectiveness Using Large Language Model Based Digital Twins

    Jasmin Han, Janardan Devkota, Joseph Waring, Amanda Luken, Felix Naughton · Feb 23, 2026 · Citations: 0

    Rubric Rating Automatic Metrics

    Perceived message effectiveness (PME) by potential intervention end-users is important for selecting and optimizing personalized smoking cessation intervention messages for mobile health (mHealth) platform delivery.

  • KLong: Training LLM Agent for Extremely Long-horizon Tasks

    Yue Liu, Zhiyuan Hu, Flood Sung, Jiaheng Zhang, Bryan Hooi · Feb 19, 2026 · Citations: 0

    Rubric Rating Automatic Metrics Long Horizon

    This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks.

  • Quantifying and Mitigating Socially Desirable Responding in LLMs: A Desirability-Matched Graded Forced-Choice Psychometric Study

    Kensuke Okada, Yui Furukawa, Kyosuke Bunji · Feb 19, 2026 · Citations: 0

    Rubric Rating Automatic Metrics

    Human self-report questionnaires are increasingly used in NLP to benchmark and audit large language models (LLMs), from persona consistency to safety and bias assessments.

  • Discovering Implicit Large Language Model Alignment Objectives

    Edward Chen, Sanmi Koyejo, Carlos Guestrin · Feb 17, 2026 · Citations: 0

    Rubric Rating Human Eval

    To address these limitations, we introduce Obj-Disco, a framework that automatically decomposes an alignment reward signal into a sparse, weighted combination of human-interpretable natural language objectives.

  • Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation

    Shiwei Hong, Lingyao Li, Ethan Z. Rong, Chenxinran Shen, Zhicong Lu · Feb 16, 2026 · Citations: 0

    Pairwise PreferenceRubric Rating Human Eval Multi Agent

    Prior work has explored multi-turn interaction and feedback for LLM writing, but evaluations still largely center on prompts and localized feedback, leaving persistent public reception in online communities underexamined.

  • Small Reward Models via Backward Inference

    Yike Wang, Faeze Brahman, Shangbin Feng, Teng Xiao, Hannaneh Hajishirzi · Feb 14, 2026 · Citations: 0

    Rubric Rating Automatic Metrics

    However, the dominant LLM-as-a-Judge paradigm relies on the strong reasoning capabilities of large models, while alternative approaches require reference responses or explicit rubrics, limiting flexibility and broader accessibility.

  • The Subjectivity of Respect in Police Traffic Stops: Modeling Community Perspectives in Body-Worn Camera Footage

    Preni Golazizian, Elnaz Rahmati, Jackson Trager, Zhivar Sourati, Nona Ghazizadeh · Feb 10, 2026 · Citations: 0

    Pairwise PreferenceRubric Rating Human Eval

    To this end, we (i) develop a domain-specific evaluation rubric grounded in procedural justice theory, LAPD training materials, and extensive fieldwork; (ii) introduce a rubric-driven preference data construction framework for perspective-c

  • Document Reconstruction Unlocks Scalable Long-Context RLVR

    Yao Xiao, Lei Wang, Yue Deng, Guanzheng Chen, Ziqi Jin · Feb 9, 2026 · Citations: 0

    Rubric Rating Automatic Metrics

    However, it often relies on gold-standard answers or explicit evaluation rubrics provided by powerful teacher models or human experts, which are costly and time-consuming.

  • APEX-Agents

    Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman · Jan 20, 2026 · Citations: 0

    Rubric RatingExpert Verification Simulation Env Long Horizon

    We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate law

  • Cross-Cultural Expert-Level Art Critique Evaluation with Vision-Language Models

    Haorui Yu, Xuehang Wen, Fengrui Zhang, Qiufeng Yi · Jan 12, 2026 · Citations: 0

    Rubric RatingCritique Edit Automatic Metrics

    Existing benchmarks assess perception without interpretation, and common evaluation proxies, such as automated metrics and LLM-judge averaging, are unreliable for culturally sensitive generative tasks.

  • HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue

    Laya Iyer, Kriti Aggarwal, Sanmi Koyejo, Gail Heyman, Desmond C. Ong · Jan 9, 2026 · Citations: 0

    Pairwise PreferenceRubric Rating Human EvalLlm As Judge

    Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans.

  • PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions

    Amith Ananthram, Elias Stengel-Eskin, Lorena A. Bradford, Julia Demarest, Adam Purvis · Oct 21, 2025 · Citations: 0

    Rubric Rating Human EvalLlm As Judge

    While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge.

  • Toward LLM-Supported Automated Assessment of Critical Thinking Subskills

    Marisa C. Peczuh, Nischal Ashok Kumar, Ryan Baker, Blair Lehman, Danielle Eisenberg · Oct 14, 2025 · Citations: 0

    Rubric Rating Automatic Metrics

    As the world becomes increasingly saturated with AI-generated content, disinformation, and algorithmic persuasion, critical thinking - the capacity to evaluate evidence, detect unreliable claims, and exercise independent judgment - is becom

  • A Scalable Framework for Evaluating Health Language Models

    Neil Mallinar, A. Ali Heydari, Xin Liu, Anthony Z. Faranesh, Brent Winslow · Mar 30, 2025 · Citations: 0

    Rubric RatingExpert Verification Automatic Metrics

    As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety.

Related Hubs