Designing and Evaluating Chain-of-Hints for Scientific Question Answering

Anubhav Jangra, Smaranda Muresan · Oct 24, 2025 · Citations: 0

Automatic Metrics General Pairwise Preference

Abstract

LLMs are reshaping education, with students increasingly relying on them for learning. Implemented using general-purpose models, these systems are likely to give away the answers, potentially undermining conceptual understanding and critical thinking. Prior work shows that hints can effectively promote cognitive engagement. Building on this insight, we evaluate 18 open-source LLMs on chain-of-hints generation that scaffold users toward the correct answer. We compare two distinct hinting strategies: static hints, pre-generated for each problem, and dynamic hints, adapted to a learners' progress. We evaluate these systems on five pedagogically grounded automatic metrics for hint quality. Using the best performing LLM as the backbone of a quantitative study with 41 participants, we uncover distinct user preferences across hinting strategies, and identify the limitations of automatic evaluation metrics to capture them. Our findings highlight key design considerations for future research on tutoring systems and contribute toward the development of more learner-centered educational technologies.

Human Data Lens

Uses human feedback: Yes
Feedback types: Pairwise Preference
Rater population: Unknown
Unit of annotation: Unknown
Expertise required: General

Evaluation Lens

Evaluation modes: Automatic Metrics
Agentic eval: None
Quality controls: Not reported
Confidence: 0.65
Flags: None

Research Summary

Contribution Summary

LLMs are reshaping education, with students increasingly relying on them for learning.
Implemented using general-purpose models, these systems are likely to give away the answers, potentially undermining conceptual understanding and critical thinking.
Prior work shows that hints can effectively promote cognitive engagement.

Why It Matters For Eval

Using the best performing LLM as the backbone of a quantitative study with 41 participants, we uncover distinct user preferences across hinting strategies, and identify the limitations of automatic evaluation metrics to capture them.