- Personalized Prediction of Perceived Message Effectiveness Using Large Language Model Based Digital Twins
Jasmin Han, Janardan Devkota, Joseph Waring, Amanda Luken, Felix Naughton · Feb 23, 2026 · Citations: 0
Rubric Rating Automatic Metrics
Model performance was assessed on three held-out messages per participant using accuracy, Cohen's kappa, and F1.
- Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation
Shiwei Hong, Lingyao Li, Ethan Z. Rong, Chenxinran Shen, Zhicong Lu · Feb 16, 2026 · Citations: 0
Pairwise PreferenceRubric Rating Multi Agent
Prior work has explored multi-turn interaction and feedback for LLM writing, but evaluations still largely center on prompts and localized feedback, leaving persistent public reception in online communities underexamined.
- Discovering Implicit Large Language Model Alignment Objectives
Edward Chen, Sanmi Koyejo, Carlos Guestrin · Feb 17, 2026 · Citations: 0
Rubric Rating Human Eval
To address these limitations, we introduce Obj-Disco, a framework that automatically decomposes an alignment reward signal into a sparse, weighted combination of human-interpretable natural language objectives.
- Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric
Ruipeng Jia, Yunyi Yang, Yuxin Wu, Yongbo Gai, Siyuan Tao · Feb 15, 2026 · Citations: 0
Pairwise PreferenceRubric Rating Llm As Judge
To operationalize this view, we present the Open Rubric System (OpenRS), a plug-and-play, rubrics-based LLM-as-a-Judge framework built around Pairwise Adaptive Meta-Rubrics (PAMR) and lightweight Pointwise Verifiable Rubrics (PVRs), which…
- RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning
Yukun Chen, Jiaming Li, Longze Chen, Ze Gong, Jingpeng Li · Feb 25, 2026 · Citations: 0
Rubric Rating Automatic Metrics
Extensive experiments on various visual reasoning benchmarks show that RuCL yields a remarkable +7.83% average improvement over the Qwen2.5-VL-7B model, achieving a state-of-the-art accuracy of 60.06%.
- Optimizing In-Context Demonstrations for LLM-based Automated Grading
Yucheng Chu, Hang Li, Kaiqi Yang, Yasemin Copur-Gencturk, Kevin Haudek · Feb 28, 2026 · Citations: 0
Rubric RatingDemonstrations
GUIDE paves the way for trusted, scalable assessment systems that align closely with human pedagogical standards.
- SibylSense: Adaptive Rubric Learning via Memory Tuning and Adversarial Probing
Yifei Xu, Guilherme Potje, Shivam Shandilya, Tiancheng Yuan, Leonardo de Oliveira Nunes · Feb 24, 2026 · Citations: 0
Rubric RatingRed Team
We present SibylSense, an inference-time learning approach that adapts a frozen rubric generator through a tunable memory bank of validated rubric items.
- LFQA-HP-1M: A Large-Scale Human Preference Dataset for Long-Form Question Answering
Rafid Ishrak Jahan, Fahmid Shahriar Iqbal, Sagnik Ray Choudhury · Feb 27, 2026 · Citations: 0
Pairwise PreferenceRubric Rating
We present LFQA-HP-1M, a large-scale dataset comprising 1.3M human pairwise preference annotations for LFQA.
- Quantifying and Mitigating Socially Desirable Responding in LLMs: A Desirability-Matched Graded Forced-Choice Psychometric Study
Kensuke Okada, Yui Furukawa, Kyosuke Bunji · Feb 19, 2026 · Citations: 0
Rubric Rating
Human self-report questionnaires are increasingly used in NLP to benchmark and audit large language models (LLMs), from persona consistency to safety and bias assessments.
- The Interspeech 2026 Audio Reasoning Challenge: Evaluating Reasoning Process Quality for Audio Reasoning Models and Agents
Ziyang Ma, Ruiyang Xu, Yinghao Ma, Chao-Han Huck Yang, Bohan Li · Feb 15, 2026 · Citations: 0
Rubric Rating
Featured Single Model and Agent tracks, the competition attracting 156 teams from 18 countries and regions.