HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue

Laya Iyer, Kriti Aggarwal, Sanmi Koyejo, Gail Heyman, Desmond C. Ong, Subhabrata Mukherjee · Jan 9, 2026 · Citations: 0

General Human Eval Llm As Judge Pairwise Preference Rubric Rating

Open arXiv Find Implementation RSS feed Shortlist (0)

Data freshness

Extraction: Fresh

Check recency before relying on this page for active eval decisions. Use stale pages as context and verify against current hub results.

Metadata refreshed

Feb 25, 2026, 1:13 AM

Stale

Extraction refreshed

Apr 13, 2026, 6:35 AM

Fresh

Extraction source

Persisted extraction

Confidence 0.70

Abstract

Supportive conversation depends on skills that go beyond language fluency, including reading emotions, adjusting tone, and navigating moments of resistance, frustration, or distress. Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans. We introduce HEART, the first-ever framework that directly compares humans and LLMs on the same multi-turn emotional-support conversations. For each dialogue history, we pair human and model responses and evaluate them through blinded human raters and an ensemble of LLM-as-judge evaluators. All assessments follow a rubric grounded in interpersonal communication science across five dimensions: Human Alignment, Empathic Responsiveness, Attunement, Resonance, and Task-Following. HEART uncovers striking behavioral patterns. Several frontier models approach or surpass the average human responses in perceived empathy and consistency. At the same time, humans maintain advantages in adaptive reframing, tension-naming, and nuanced tone shifts, particularly in adversarial turns. Human and LLM-as-judge preferences align on about 80 percent of pairwise comparisons, matching inter-human agreement, and their written rationales emphasize similar HEART dimensions. This pattern suggests an emerging convergence in the criteria used to assess supportive quality. By placing humans and models on equal footing, HEART reframes supportive dialogue as a distinct capability axis, separable from general reasoning or linguistic fluency. It provides a unified empirical foundation for understanding where model-generated support aligns with human social judgment, where it diverges, and how affective conversational competence scales with model size.

HFEPX Relevance Assessment

This paper has strong direct human-feedback and evaluation protocol signal and is suitable as a primary eval pipeline reference.

Best use

Primary benchmark and eval reference

Use if you need

A secondary eval reference to pair with stronger protocol papers.

Main weakness

No major weakness surfaced.

Trust level

Moderate

Eval-Fit Score

79/100 • High

Use this as a primary source when designing or comparing eval protocols.

Human Feedback Signal

Detected

Evaluation Signal

Detected

HFEPX Fit

High-confidence candidate

Extraction confidence: Moderate

If you are doing eval pipeline work, start here:

Human Eval Hub LLM-as-Judge Hub Pairwise Preference Hub Tool-Use Eval Hub

Field Provenance & Confidence

Each key protocol field shows extraction state, confidence band, and data source so you can decide whether to trust it directly or validate from full text.

Human Feedback Types

strong

Pairwise Preference, Rubric Rating

Confidence: Moderate Source: Persisted extraction evidenced

Directly usable for protocol triage.

Evidence snippet: Supportive conversation depends on skills that go beyond language fluency, including reading emotions, adjusting tone, and navigating moments of resistance, frustration, or distress.

Evaluation Modes

strong

Human Eval, Llm As Judge

Confidence: Moderate Source: Persisted extraction evidenced

Includes extracted eval setup.

Quality Controls

missing

Not reported

Confidence: Low Source: Persisted extraction missing

No explicit QC controls found.

Benchmarks / Datasets

missing

Not extracted

Confidence: Low Source: Persisted extraction missing

No benchmark anchors detected.

Reported Metrics

strong

Agreement

Confidence: Moderate Source: Persisted extraction evidenced

Useful for evaluation criteria comparison.

Evidence snippet: Human and LLM-as-judge preferences align on about 80 percent of pairwise comparisons, matching inter-human agreement, and their written rationales emphasize similar HEART dimensions.

Rater Population

missing

Unknown

Confidence: Low Source: Persisted extraction missing

Rater source not explicitly reported.

Evidence snippet: For each dialogue history, we pair human and model responses and evaluate them through blinded human raters and an ensemble of LLM-as-judge evaluators.

Human Data Lens

Uses human feedback: Yes
Feedback types: Pairwise Preference, Rubric Rating
Rater population: Unknown
Unit of annotation: Pairwise
Expertise required: General
Extraction source: Persisted extraction

Evaluation Lens

Evaluation modes: Human Eval, Llm As Judge
Agentic eval: None
Quality controls: Not reported
Confidence: 0.70
Flags: None

Protocol And Measurement Signals

Benchmarks / Datasets

No benchmark or dataset names were extracted from the available abstract.

Reported Metrics

agreement

Research Brief

Deterministic synthesis

Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans. HFEPX signals include Pairwise Preference, Rubric Rating, Human Eval with confidence 0.70. Updated from current HFEPX corpus.

Generated Apr 13, 2026, 6:35 AM · Grounded in abstract + metadata only

Key Takeaways

Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans.
We introduce HEART, the first-ever framework that directly compares humans and LLMs on the same multi-turn emotional-support conversations.

Researcher Actions

Compare its human-feedback setup against pairwise and rubric hubs.
Identify benchmark choices from full text before operationalizing conclusions.
Validate metric comparability (agreement).

Caveats

Generated from title, abstract, and extracted metadata only; full-paper implementation details are not parsed.
Extraction confidence is probabilistic and should be validated for critical decisions.

Recommended Queries

llm-as-judge calibration pairwise preference data quality inter-rater agreement adjudication

Research Summary

Contribution Summary

Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans.
We introduce HEART, the first-ever framework that directly compares humans and LLMs on the same multi-turn emotional-support conversations.
For each dialogue history, we pair human and model responses and evaluate them through blinded human raters and an ensemble of LLM-as-judge evaluators.

Why It Matters For Eval

Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans.
We introduce HEART, the first-ever framework that directly compares humans and LLMs on the same multi-turn emotional-support conversations.

Researcher Checklist

Pass: Human feedback protocol is explicit

Detected: Pairwise Preference, Rubric Rating
Pass: Evaluation mode is explicit

Detected: Human Eval, Llm As Judge
Gap: Quality control reporting appears

No calibration/adjudication/IAA control explicitly detected.
Gap: Benchmark or dataset anchors are present

No benchmark/dataset anchor extracted from abstract.
Pass: Metric reporting is present

Detected: agreement

Related Papers

Papers are ranked by protocol overlap, extraction signal alignment, and semantic proximity.

Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants
Human-feedback overlap Protocol Overlap

Citations: 0 Relevance: 12.00 Shared tag: Pairwise PreferenceShared tag: Rubric RatingShared tag: Llm As Judge
- Shared 3 HFEPX protocol tags
- Aligned human feedback protocol
Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric
Human-feedback overlap Protocol Overlap

Citations: 0 Relevance: 12.00 Shared tag: Pairwise PreferenceShared tag: Rubric RatingShared tag: Llm As Judge
- Shared 3 HFEPX protocol tags
- Aligned human feedback protocol
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Human-feedback overlap Protocol Overlap

Citations: 0 Relevance: 12.00 Shared tag: Pairwise PreferenceShared tag: Rubric RatingShared tag: Human Eval
- Shared 3 HFEPX protocol tags
- Aligned human feedback protocol
Self-Preference Bias in Rubric-Based Evaluation of Large Language Models
Human-feedback overlap Protocol Overlap

Citations: 0 Relevance: 12.00 Shared tag: Pairwise PreferenceShared tag: Rubric RatingShared tag: Llm As Judge
- Shared 3 HFEPX protocol tags
- Aligned human feedback protocol
PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions
Human-feedback overlap Protocol Overlap

Citations: 0 Relevance: 11.70 Shared tag: Rubric RatingShared tag: Human EvalShared tag: Llm As Judge
- Shared 3 HFEPX protocol tags
- Aligned human feedback protocol
MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages
Benchmark-aligned Protocol Overlap

Citations: 0 Relevance: 9.10 Shared tag: Pairwise PreferenceShared tag: Rubric Rating
- Shared 2 HFEPX protocol tags
- Aligned human feedback protocol
Evaluating Austrian A-Level German Essays with Large Language Models for Automated Essay Scoring
Benchmark-aligned Protocol Overlap

Citations: 0 Relevance: 8.80 Shared tag: Rubric RatingShared tag: Human Eval
- Shared 2 HFEPX protocol tags
- Aligned human feedback protocol
LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias
Benchmark-aligned Protocol Overlap

Citations: 0 Relevance: 8.80 Shared tag: Rubric RatingShared tag: Human Eval
- Shared 2 HFEPX protocol tags
- Aligned human feedback protocol

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote