- LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias
Filip J. Kucia, Anirban Chakraborty, Anna Wróblewska · Mar 31, 2026 · Citations: 0
Human Eval General
We present a systematic evaluation of instruction-tuned LLMs across three open essay-scoring datasets (ASAP 2.0, ELLIPSE, and DREsS) that cover both holistic and analytic scoring.
- More Human, More Efficient: Aligning Annotations with Quantized SLMs
Jiayu Wang, Junyoung Lee · Apr 1, 2026 · Citations: 0
Automatic Metrics General
As Large Language Model (LLM) capabilities advance, the demand for high-quality annotation of exponentially increasing text corpora has outpaced human capacity, leading to the widespread adoption of LLMs in automatic evaluation and…
- Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge
Mingyang Song, Mao Zheng, Chenning Xu · Mar 11, 2026 · Citations: 0
Llm As Judge General
Through a large-scale study of 105,600 evaluation instances (32 LLMs \times 3 frontier judges \times 100 tasks \times 11 temperatures), we show that model-level agreement (Spearman ρ= 0.99) masks fragile sample-level agreement (Pearson r =…
- Evaluating Austrian A-Level German Essays with Large Language Models for Automated Essay Scoring
Jonas Kubesch, Lena Huber, Clemens Havas · Mar 6, 2026 · Citations: 0
Human Eval General
This paper investigates the application of state-of-the-art open-weight LLMs for the grading of Austrian A-level German texts, with a particular focus on rubric-based evaluation.
- When AI Meets Early Childhood Education: Large Language Models as Assessment Teammates in Chinese Preschools
Xingming Li, Runke Huang, Yanan Bao, Yuye Jin, Yuru Jiao · Mar 25, 2026 · Citations: 0
Automatic Metrics General
In this paper, we investigate whether AI can serve as a scalable assessment teammate by extracting structured quality indicators and validating their alignment with human expert judgments.
- From Consensus to Split Decisions: ABC-Stratified Sentiment in Holocaust Oral Histories
Daban Q. Jaff · Mar 30, 2026 · Citations: 0
Automatic Metrics General
After assembling model outputs, we introduce an agreement-based stability taxonomy (ABC) to stratify inter-model output stability.
- PONTE: Personalized Orchestration for Natural Language Trustworthy Explanations
Vittoria Vineis, Matteo Silvestri, Lorenzo Antonelli, Filippo Betello, Gabriele Tolomei · Mar 6, 2026 · Citations: 0
Human Eval General
To address these challenges, we present PONTE (Personalized Orchestration for Natural language Trustworthy Explanations), a human-in-the-loop framework for adaptive and reliable XAI narratives.
- ReasonScaffold: A Scaffolded Reasoning-based Annotation Protocol for Human-AI Co-Annotation
Smitha Muthya Sudheendra, Jaideep Srivastava · Mar 22, 2026 · Citations: 0
Automatic Metrics General
We evaluate the approach on sentiment classification and opinion detection tasks, analyzing changes in inter-annotator agreement and revision behavior.
- Same Words, Different Judgments: Modality Effects on Preference Alignment
Aaron Broukhim, Nadir Weibel, Eshin Jolly · Feb 26, 2026 · Citations: 0
Automatic Metrics General
Preference-based reinforcement learning (PbRL) is the dominant framework for aligning AI systems to human preferences, but its application to speech remains underexplored.
- Strengthening Human-Centric Chain-of-Thought Reasoning Integrity in LLMs via a Structured Prompt Framework
Jiling Zhou, Aisvarya Adeseye, Seppo Virtanen, Antti Hakkala, Jouni Isoaho · Apr 6, 2026 · Citations: 0
Human EvalAutomatic Metrics General
However, its reliability in security-sensitive analytical tasks remains insufficiently examined, particularly under structured human evaluation.
- Modeling Grammatical Hypothesis Testing in Young Learners: A Sequence-Based Learning Analytics Study of Morphosyntactic Reasoning in an Interactive Game
Thierry Geoffre, Trystan Geoffre · Mar 2, 2026 · Citations: 0
General
Analyzing 597 gameplay sessions (9,783 actions) from 100 students aged 8-11 in authentic classroom settings, we introduce Hamming distance to quantify proximity to valid grammatical solutions and examine convergence patterns across…