A Multi-faceted Analysis of Cognitive Abilities: Evaluating Prompt Methods with Large Language Models on the CONSORT Checklist

Sohyeon Jeon, Hyung-Chul Lee · Oct 22, 2025 · Citations: 0

Abstract

Despite the rapid expansion of Large Language Models (LLMs) in healthcare, robust and explainable evaluation of their ability to assess clinical trial reporting according to CONSORT standards remains an open challenge. In particular, uncertainty calibration and metacognitive reliability of LLM reasoning are poorly understood and underexplored in medical automation. This study applies a behavioral and metacognitive analytic approach using an expert-validated dataset, systematically comparing two representative LLMs - one general and one domain-specialized - across three prompt strategies. We analyze both cognitive adaptation and calibration error using metrics: Expected Calibration Error (ECE) and a baseline-normalized Relative Calibration Error (RCE) that enables reliable cross-model comparison. Our results reveal pronounced miscalibration and overconfidence in both models, especially under clinical role-playing conditions, with calibration error persisting above clinically relevant thresholds. These findings underscore the need for improved calibration, transparent code, and strategic prompt engineering to develop reliable and explainable medical AI.

Human Data Lens

Uses human feedback: No
Feedback types: None
Rater population: Domain Experts
Unit of annotation: Unknown
Expertise required: Medicine, Coding

Evaluation Lens

Evaluation modes: Automatic Metrics
Agentic eval: None
Quality controls: Calibration
Confidence: 0.45
Flags: low_signal, possible_false_positive

Research Summary

Contribution Summary

Despite the rapid expansion of Large Language Models (LLMs) in healthcare, robust and explainable evaluation of their ability to assess clinical trial reporting according to CONSORT standards remains an open challenge.
In particular, uncertainty calibration and metacognitive reliability of LLM reasoning are poorly understood and underexplored in medical automation.
This study applies a behavioral and metacognitive analytic approach using an expert-validated dataset, systematically comparing two representative LLMs - one general and one domain-specialized - across three prompt strategies.

Why It Matters For Eval

Despite the rapid expansion of Large Language Models (LLMs) in healthcare, robust and explainable evaluation of their ability to assess clinical trial reporting according to CONSORT standards remains an open challenge.