HFEPX Hub

CS.HC Human Feedback And Eval Papers

Updated from current HFEPX corpus (Feb 27, 2026). 35 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Multi Dim Rubric. Frequently cited benchmark: Retrieval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 35 Last published: Feb 26, 2026 Global RSS

Cs.HC

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 35 papers for CS.HC Human Feedback And Eval Papers. Dominant protocol signals include automatic metrics, simulation environments, human evaluation, with frequent benchmark focus on Retrieval and metric focus on accuracy, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

34.3% of papers report explicit human-feedback signals, led by expert verification.

Evidence: TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation , SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery , "Are You Sure?": An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems , An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems
automatic metrics appears in 80% of papers in this hub.

Evidence: LLM Novice Uplift on Dual-Use, In Silico Biology Tasks , When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models , Exploring Human-Machine Coexistence in Symmetrical Reality , Evaluating the Usage of African-American Vernacular English in Large Language Models
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: LLM Novice Uplift on Dual-Use, In Silico Biology Tasks , TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation , Dynamic Personality Adaptation in Large Language Models via State Machines , When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.

Evidence: LLM Novice Uplift on Dual-Use, In Silico Biology Tasks , TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation , Dynamic Personality Adaptation in Large Language Models via State Machines , When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models
Rater context is mostly domain experts, and annotation is commonly multi-dimensional rubrics; use this to scope replication staffing.

Evidence: LLM Novice Uplift on Dual-Use, In Silico Biology Tasks , TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation , SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery , "Are You Sure?": An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Evidence: LLM Novice Uplift on Dual-Use, In Silico Biology Tasks , TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation , Dynamic Personality Adaptation in Large Language Models via State Machines , When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models

Benchmark Interpretation

Retrieval appears in 2.9% of hub papers (1/35); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 14.3% of hub papers (5/35); compare with a secondary metric before ranking methods.
cost is reported in 8.6% of hub papers (3/35); compare with a secondary metric before ranking methods.

Researcher Checklist

Tighten coverage on Papers with explicit human feedback. Coverage is usable but incomplete (34.3% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (0% vs 30% target).
Close gap on Papers naming benchmarks/datasets. Coverage is a replication risk (2.9% vs 35% target).
Tighten coverage on Papers naming evaluation metrics. Coverage is usable but incomplete (31.4% vs 35% target).
Tighten coverage on Papers with known rater population. Coverage is usable but incomplete (25.7% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (11.4% vs 35% target).

Papers with explicit human feedback

Coverage is usable but incomplete (34.3% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is a replication risk (2.9% vs 35% target).

Papers naming evaluation metrics

Coverage is usable but incomplete (31.4% vs 35% target).

Papers with known rater population

Coverage is usable but incomplete (25.7% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (11.4% vs 35% target).

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Annotation unit is under-specified (11.4% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Human Eval Protocols - Surfaces human-rating workflows for rubric and annotator quality analysis.
Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs automatic_metrics

both=0, left_only=1, right_only=28

0 papers use both Human Eval and Automatic Metrics.

automatic_metrics vs simulation_env

both=0, left_only=28, right_only=6

0 papers use both Automatic Metrics and Simulation Env.

simulation_env vs human_eval

both=0, left_only=6, right_only=1

0 papers use both Simulation Env and Human Eval.

Benchmark Brief

Retrieval

Coverage: 1 papers (2.9%)

1 papers (2.9%) mention Retrieval.

Examples: Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition

Metric Brief

accuracy

Coverage: 5 papers (14.3%)

5 papers (14.3%) mention accuracy.

Examples: Modeling Distinct Human Interaction in Web Agents , What Do LLMs Associate with Your Name? A Human-Centered Black-Box Audit of Personal Data , How much does context affect the accuracy of AI health advice?

Metric Brief

cost

Coverage: 3 papers (8.6%)

3 papers (8.6%) mention cost.

Examples: When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models , SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery , A Scalable Framework for Evaluating Health Language Models

Metric Brief

agreement

Coverage: 2 papers (5.7%)

2 papers (5.7%) mention agreement.

Examples: Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters , A Scalable Framework for Evaluating Health Language Models

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: LLM Novice Uplift on Dual-Use, In Silico Biology Tasks , TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation , Dynamic Personality Adaptation in Large Language Models via State Machines

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers

LLM Novice Uplift on Dual-Use, In Silico Biology Tasks
Chen Bo Calvin Zhang, Christina Q. Knight, Nicholas Kruus, Jason Hausenloy, Pedro Medeiros · Feb 26, 2026 · Citations: 0

Automatic Metrics

Large language models (LLMs) perform increasingly well on biology benchmarks, but it remains unclear whether they uplift novice users -- i.e., enable humans to perform better than with internet-only resources.
TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation
Joydeep Chandra, Satyam Kumar Navneet, Yong Zhang · Feb 26, 2026 · Citations: 0

Expert Verification Simulation Env Multi Agent

As mental health chatbots proliferate to address the global treatment gap, a critical question emerges: How do we design for relational safety the quality of interaction patterns that unfold across conversations rather than the correctness
Dynamic Personality Adaptation in Large Language Models via State Machines
Leon Pielage, Ole Hätscher, Mitja Back, Bernhard Marschall, Benjamin Risse · Feb 25, 2026 · Citations: 0

Simulation Env

This work demonstrates the feasibility of modular, personality-adaptive architectures for education, customer support, and broader human-computer interaction.
When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models
Satyam Kumar Navneet, Joydeep Chandra, Yong Zhang · Feb 25, 2026 · Citations: 0

Automatic Metrics

Large Language Models (LLMs) are increasingly used to ``professionalize'' workplace communication, often at the cost of linguistic identity.
Exploring Human-Machine Coexistence in Symmetrical Reality
Zhenliang Zhang · Feb 25, 2026 · Citations: 0

Automatic Metrics

In the context of the evolution of artificial intelligence (AI), the interaction between humans and AI entities has become increasingly salient, challenging the conventional human-centric paradigms of human-machine interaction.
Evaluating the Usage of African-American Vernacular English in Large Language Models
Deja Dunlap, R. Thomas McCoy · Feb 25, 2026 · Citations: 0

Automatic Metrics

In AI, most evaluations of natural language understanding tasks are conducted in standardized dialects such as Standard American English (SAE).
SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery
David Anugraha, Vishakh Padmakumar, Diyi Yang · Feb 24, 2026 · Citations: 0

Expert Verification Automatic Metrics Multi Agent

Based on this formulation, we introduce SparkMe, a multi-agent LLM interviewer that performs deliberative planning via simulated conversation rollouts to select questions with high expected utility.
"Are You Sure?": An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems
Xinfeng Li, Shenyu Dai, Kelong Zheng, Yue Xiao, Gelei Deng · Feb 24, 2026 · Citations: 0

Expert Verification Automatic Metrics

Large language model (LLM) agents are rapidly becoming trusted copilots in high-stakes domains like software development and healthcare.
An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems
Anna Martin-Boyle, William Humphreys, Martha Brown, Cara Leckey, Harmanpreet Kaur · Feb 24, 2026 · Citations: 0

Expert Verification Automatic Metrics

Current evaluation metrics for testing LLM reliability are primarily automated approaches that prioritize efficiency and scalability, but lack contextual nuance and fail to reflect how scientific domain experts assess LLM outputs in practic
PaperTrail: A Claim-Evidence Interface for Grounding Provenance in LLM-based Scholarly Q&A
Anna Martin-Boyle, Cara A. C. Leckey, Martha C. Brown, Harmanpreet Kaur · Feb 24, 2026 · Citations: 0

Automatic Metrics

Large language models (LLMs) are increasingly used in scholarly question-answering (QA) systems to help researchers synthesize vast amounts of literature.
Assessing Risks of Large Language Models in Mental Health Support: A Framework for Automated Clinical AI Red Teaming
Ian Steenstra, Paola Pedrelli, Weiyan Shi, Stacy Marsella, Timothy W. Bickmore · Feb 23, 2026 · Citations: 0

Red Team Simulation Env

Large Language Models (LLMs) are increasingly utilized for mental health support; however, current safety benchmarks often fail to detect the complex, longitudinal risks inherent in therapeutic dialogue.
PuppetChat: Fostering Intimate Communication through Bidirectional Actions and Micronarratives
Emma Jiren Wang, Siying Hu, Zhicong Lu · Feb 23, 2026 · Citations: 0

Automatic Metrics

As a primary channel for sustaining modern intimate relationships, instant messaging facilitates frequent connection across distances.
Hierarchical Reward Design from Language: Enhancing Alignment of Agent Behavior with Human Specifications
Zhiqin Qian, Ryan Diaz, Sangwon Seo, Vaibhav Unhelkar · Feb 20, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Long Horizon

When training artificial intelligence (AI) to perform tasks, humans often care not only about whether a task is completed but also how it is performed.
Games That Teach, Chats That Convince: Comparing Interactive and Static Formats for Persuasive Learning
Seyed Hossein Alavi, Zining Wang, Shruthi Chockkalingam, Raymond T. Ng, Vered Shwartz · Feb 20, 2026 · Citations: 0

Automatic Metrics

Interactive systems such as chatbots and games are increasingly used to persuade and educate on sustainability-related topics, yet it remains unclear how different delivery formats shape learning and persuasive outcomes when content is held
Mind the Style: Impact of Communication Style on Human-Chatbot Interaction
Erik Derner, Dalibor Kučera, Aditya Gulati, Ayoub Bagheri, Nuria Oliver · Feb 19, 2026 · Citations: 0

Automatic Metrics Web Browsing

Conversational agents increasingly mediate everyday digital interactions, yet the effects of their communication style on user experience and task success remain unclear.
Modeling Distinct Human Interaction in Web Agents
Faria Huq, Zora Zhiruo Wang, Zhanqiu Guo, Venu Arvind Arangarajan, Tianyue Ou · Feb 19, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Web Browsing

Despite rapid progress in autonomous web agents, human involvement remains essential for shaping preferences and correcting agent behavior as tasks unfold.
What Do LLMs Associate with Your Name? A Human-Centered Black-Box Audit of Personal Data
Dimitri Staufer, Kirsten Morehouse · Feb 19, 2026 · Citations: 0

Automatic Metrics

Large language models (LLMs), and conversational agents based on them, are exposed to personal data (PD) during pre-training and during user interactions.
Auditing Reciprocal Sentiment Alignment: Inversion Risk, Dialect Representation and Intent Misalignment in Transformers
Nusrat Jahan Lia, Shubhashis Roy Dipta · Feb 19, 2026 · Citations: 0

Automatic Metrics

The core theme of bidirectional alignment is ensuring that AI systems accurately understand human intent and that humans can trust AI behavior.
Surgical Activation Steering via Generative Causal Mediation
Aruna Sankaranarayanan, Amir Zur, Atticus Geiger, Dylan Hadfield-Menell · Feb 17, 2026 · Citations: 0

Automatic Metrics

Where should we intervene in a language model (LM) to control behaviors that are diffused across many tokens of a long-form response?
Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation
Shiwei Hong, Lingyao Li, Ethan Z. Rong, Chenxinran Shen, Zhicong Lu · Feb 16, 2026 · Citations: 0

Pairwise PreferenceRubric Rating Human Eval Multi Agent

Prior work has explored multi-turn interaction and feedback for LLM writing, but evaluations still largely center on prompts and localized feedback, leaving persistent public reception in online communities underexamined.
Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters
Xingjian Zhang, Tianhong Gao, Suliang Jin, Tianhao Wang, Teng Ye · Oct 29, 2025 · Citations: 0

Automatic Metrics

Large language models (LLMs) are increasingly used as raters for evaluation tasks.
Designing and Evaluating Chain-of-Hints for Scientific Question Answering
Anubhav Jangra, Smaranda Muresan · Oct 24, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

Using the best performing LLM as the backbone of a quantitative study with 41 participants, we uncover distinct user preferences across hinting strategies, and identify the limitations of automatic evaluation metrics to capture them.
Detecting Early and Implicit Suicidal Ideation via Longitudinal and Information Environment Signals on Social Media
Soorya Ram Shimgekar, Ruining Zhao, Agam Goyal, Violeta J. Rodriguez, Paul A. Bloom · Oct 16, 2025 · Citations: 0

Simulation Env

On social media, several individuals experiencing suicidal ideation (SI) do not disclose their distress explicitly.
Everything is Plausible: Investigating the Impact of LLM Rationales on Human Notions of Plausibility
Shramay Palta, Peter Rankel, Sarah Wiegreffe, Rachel Rudinger · Oct 9, 2025 · Citations: 0

Automatic Metrics

We investigate the degree to which human plausibility judgments of multiple-choice commonsense benchmark answers are subject to influence by (im)plausibility arguments for or against an answer, in particular, using rationales generated by L
ClearFairy: Capturing Creative Workflows through Decision Structuring, In-Situ Questioning, and Rationale Inference
Kihoon Son, DaEun Choi, Tae Soo Kim, Young-Ho Kim, Sangdoo Yun · Sep 18, 2025 · Citations: 0

Critique Edit Automatic Metrics

Furthermore, exploratory applications demonstrate that captured steps can enhance generative AI agents in Figma, yielding predictions better aligned with professionals and producing coherent outcomes.
The AI Memory Gap: Users Misremember What They Created With AI or Without
Tim Zindulka, Sven Goller, Daniela Fernandes, Robin Welsch, Daniel Buschek · Sep 15, 2025 · Citations: 0

Automatic Metrics

Our findings reveal a significant gap in memory: After AI use, the odds of correct attribution dropped, with the steepest decline in mixed human-AI workflows, where either the idea or elaboration was created with AI.
Collaborative Document Editing with Multiple Users and AI Agents
Florian Lehmann, Krystsina Shauchenka, Daniel Buschek · Sep 15, 2025 · Citations: 0

Simulation Env Multi Agent

We propose integrating AI agents directly into collaborative writing environments.
When Algorithms Meet Artists: Semantic Compression of Artists' Concerns in the Public AI-Art Debate
Ariya Mukherjee-Gandhi, Oliver Muellerklein · Aug 5, 2025 · Citations: 0

Automatic Metrics

Artists occupy a paradoxical position in generative AI: their work trains the models reshaping creative labor.
Sensory-Motor Control with Large Language Models via Iterative Policy Refinement
Jônata Tyska Carvalho, Stefano Nolfi · Jun 5, 2025 · Citations: 0

Simulation Env

We propose a method that enables large language models (LLMs) to control embodied agents through the generation of control policies that directly map continuous observation vectors to continuous action vectors.
Toward Beginner-Friendly LLMs for Language Learning: Controlling Difficulty in Conversation
Meiqing Jin, Liam Dugan, Chris Callison-Burch · Jun 4, 2025 · Citations: 0

Automatic Metrics

We further introduce a new token-level evaluation metric, Token Miss Rate (TMR), that quantifies the proportion of incomprehensible tokens per utterance and correlates strongly with human judgments.
Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition
Zheng Hui, Xiaokai Wei, Yexi Jiang, Kevin Gao, Chen Wang · Apr 26, 2025 · Citations: 0

Pairwise Preference Automatic Metrics Multi Agent

These domains typically involve fixed content and passive consumption, where user preferences can be matched by genre or theme.
How much does context affect the accuracy of AI health advice?
Prashant Garg, Thiemo Fetzer · Apr 25, 2025 · Citations: 0

Automatic Metrics

English-language performance does not reliably generalise across contexts, underscoring the need for multilingual, domain-specific evaluation before deployment in public-health communication.
A Scalable Framework for Evaluating Health Language Models
Neil Mallinar, A. Ali Heydari, Xin Liu, Anthony Z. Faranesh, Brent Winslow · Mar 30, 2025 · Citations: 0

Rubric RatingExpert Verification Automatic Metrics

As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety.
HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs
Tin Nguyen, Logan Bolton, Mohammad Reza Taesiri, Trung Bui, Anh Totti Nguyen · Mar 3, 2025 · Citations: 0

Automatic Metrics

A response mixed of factual and non-factual statements poses a challenge for humans to verify and accurately base their decisions on.
Usability Study of Security Features in Programmable Logic Controllers
Karen Li, Kopo M. Ramokapane, Awais Rashid · Aug 4, 2022 · Citations: 0

Automatic Metrics Web Browsing

Our results uncover various misperceptions about the security controls and how design constraints, e.g., safety and lack of regular updates due to the long-term nature of such systems, provide significant challenges to the realization of mo

CS.HC Human Feedback And Eval Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers

Related Hubs