- CounselReflect: A Toolkit for Auditing Mental-Health Dialogues
Yahan Li, Chaohao Du, Zeyang Li, Christopher Chun Kuizon, Shupeng Cheng · Mar 31, 2026 · Citations: 0
Rubric RatingExpert Verification Human Eval Web Browsing
The system integrates two families of evaluation signals: (i) 12 model-based metrics produced by task-specific predictors, and (ii) rubric-based metrics that extend coverage via a literature-derived library (69 metrics) and user-defined…
- LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias
Filip J. Kucia, Anirban Chakraborty, Anna Wróblewska · Mar 31, 2026 · Citations: 0
Rubric Rating Human Eval
We present a systematic evaluation of instruction-tuned LLMs across three open essay-scoring datasets (ASAP 2.0, ELLIPSE, and DREsS) that cover both holistic and analytic scoring.
- Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Qiyao Ma, Dechen Gao, Rui Cai, Boqi Zhao, Hanchu Zhou · Apr 8, 2026 · Citations: 0
Pairwise PreferenceRubric Rating Human EvalAutomatic Metrics
Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values.
- More Human, More Efficient: Aligning Annotations with Quantized SLMs
Jiayu Wang, Junyoung Lee · Apr 1, 2026 · Citations: 0
Rubric Rating Automatic Metrics
As Large Language Model (LLM) capabilities advance, the demand for high-quality annotation of exponentially increasing text corpora has outpaced human capacity, leading to the widespread adoption of LLMs in automatic evaluation and…
- PRBench: End-to-end Paper Reproduction in Physics Research
Shi Qiu, Junyi Deng, Yiwei Deng, Haoran Dong, Jieyu Fu · Mar 29, 2026 · Citations: 0
Rubric RatingExpert Verification Automatic MetricsSimulation Env
We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 11 subfields of physics.
- Beyond Paper-to-Paper: Structured Profiling and Rubric Scoring for Paper-Reviewer Matching
Yicheng Pan, Zhiyuan Ning, Ludi Wang, Yi Du · Apr 7, 2026 · Citations: 0
Rubric Rating Automatic Metrics
To address this gap, we propose P2R, a training-free framework that shifts from implicit paper-to-paper matching to explicit profile-based matching.
- Self-Preference Bias in Rubric-Based Evaluation of Large Language Models
José Pombal, Ricardo Rei, André F. T. Martins · Apr 8, 2026 · Citations: 0
Pairwise PreferenceRubric Rating Llm As Judge
We present the first study of SPB in rubric-based evaluation, an increasingly popular benchmarking paradigm where judges issue binary verdicts on individual evaluation criteria, instead of assigning holistic scores or rankings.
- Paper Reconstruction Evaluation: Evaluating Presentation and Hallucination in AI-written Papers
Atsuyuki Miyai, Mashiro Toyooka, Zaiying Zhao, Kenta Watanabe, Toshihiko Yamasaki · Apr 1, 2026 · Citations: 0
Rubric Rating Automatic Metrics
We introduce Paper Reconstruction Evaluation (PaperRecon), an evaluation framework in which an overview (overview.md) is created from an existing paper, after which an agent generates a full paper based on the overview and minimal…
- Rethinking Atomic Decomposition for LLM Judges: A Prompt-Controlled Study of Reference-Grounded QA Evaluation
Xinran Zhang · Mar 30, 2026 · Citations: 0
Rubric Rating Automatic Metrics
Atomic decomposition -- breaking a candidate answer into claims before verifying each against a reference -- is a widely adopted design for LLM-based reference-grounded judges.
- FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks
Michael Krumdick, Varshini Reddy, Shivani Chaudhary, William Day, Maarij Ahmed · Apr 7, 2026 · Citations: 0
Rubric Rating Long Horizon
To address this, we introduce FrontierFinance, a long-horizon benchmark of 25 complex financial modeling tasks across five core finance models, requiring an average of over 18 hours of skilled human labor per task to complete.
- QED-Nano: Teaching a Tiny Model to Prove Hard Theorems
LM-Provers, Yuxiao Qu, Amrith Setlur, Jasper Dekoninck, Edward Beeching · Apr 6, 2026 · Citations: 0
Rubric Rating Automatic Metrics
To support further research on open mathematical reasoning, we release the full QED-Nano pipeline, including the QED-Nano and QED-Nano-SFT models, the FineProofs-SFT and FineProofs-RL datasets, and the training and evaluation code.
- Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement
Wataru Hirota, Tomoki Taniguchi, Tomoko Ohkuma, Kosuke Takahashi, Takahiro Omi · Apr 24, 2026 · Citations: 0
Rubric RatingExpert Verification
Unlike standard NLP benchmarks, business idea evaluation relies on multi-dimensional criteria such as feasibility, novelty, differentiation, user need, and market size, and expert judgments often disagree.
- MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome
Fangda Ye, Yuxin Hu, Pengxiang Zhu, Yibo Li, Ziqi Jin · Mar 30, 2026 · Citations: 0
Rubric Rating
Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs.
- Optimsyn: Influence-Guided Rubrics Optimization for Synthetic Data Generation
Zhiting Fan, Ruizhe Chen, Tianxiang Hu, Ru Peng, Zenan Huang · Apr 1, 2026 · Citations: 0
Rubric RatingCritique Edit
However, high-quality SFT data in knowledge-intensive domains such as humanities, social sciences, medicine, law, and finance is scarce because expert curation is expensive, privacy constraints are strict, and label consistency is hard to…
- Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework
Tharindu Kumarage, Lisa Bauer, Yao Ma, Dan Rosen, Yashasvi Raghavendra Guduri · Apr 23, 2026 · Citations: 0
Rubric Rating
To address this gap, we introduce ESRRSim, a taxonomy-driven agentic framework for automated behavioral risk evaluation.
- Training data generation for context-dependent rubric-based short answer grading
Pavel Šindelář, Dávid Slivka, Christopher Bouma, Filip Prášil, Ondřej Bojar · Mar 30, 2026 · Citations: 0
Rubric Rating
However, having to avoid language differences and annotator bias makes the grading of student answers challenging.