HFEPX Hub

CS.HC + Automatic Metrics Papers

Updated from current HFEPX corpus (Feb 27, 2026). 28 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Multi Dim Rubric. Frequently cited benchmark: Retrieval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 28 Last published: Feb 26, 2026 Global RSS Tag RSS

Cs.HCAutomatic Metrics

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 28 papers for CS.HC + Automatic Metrics Papers. Dominant protocol signals include automatic metrics, with frequent benchmark focus on Retrieval and metric focus on accuracy, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

32.1% of papers report explicit human-feedback signals, led by expert verification.

Evidence: SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery , "Are You Sure?": An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems , An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems , LLM Novice Uplift on Dual-Use, In Silico Biology Tasks
automatic metrics appears in 100% of papers in this hub.

Evidence: LLM Novice Uplift on Dual-Use, In Silico Biology Tasks , When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models , Exploring Human-Machine Coexistence in Symmetrical Reality , Evaluating the Usage of African-American Vernacular English in Large Language Models
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: LLM Novice Uplift on Dual-Use, In Silico Biology Tasks , When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models , Exploring Human-Machine Coexistence in Symmetrical Reality , Evaluating the Usage of African-American Vernacular English in Large Language Models

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.

Evidence: LLM Novice Uplift on Dual-Use, In Silico Biology Tasks , When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models , Exploring Human-Machine Coexistence in Symmetrical Reality , Evaluating the Usage of African-American Vernacular English in Large Language Models
Rater context is mostly domain experts, and annotation is commonly multi-dimensional rubrics; use this to scope replication staffing.

Evidence: LLM Novice Uplift on Dual-Use, In Silico Biology Tasks , SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery , "Are You Sure?": An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems , An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems
Track metric sensitivity by reporting both accuracy and cost.

Evidence: LLM Novice Uplift on Dual-Use, In Silico Biology Tasks , When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models , Exploring Human-Machine Coexistence in Symmetrical Reality , Evaluating the Usage of African-American Vernacular English in Large Language Models

Benchmark Interpretation

Retrieval appears in 3.6% of hub papers (1/28); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 17.9% of hub papers (5/28); compare with a secondary metric before ranking methods.
cost is reported in 10.7% of hub papers (3/28); compare with a secondary metric before ranking methods.

Researcher Checklist

Tighten coverage on Papers with explicit human feedback. Coverage is usable but incomplete (32.1% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (0% vs 30% target).
Close gap on Papers naming benchmarks/datasets. Coverage is a replication risk (3.6% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (35.7% vs 35% target).
Tighten coverage on Papers with known rater population. Coverage is usable but incomplete (21.4% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (10.7% vs 35% target).

Papers with explicit human feedback

Coverage is usable but incomplete (32.1% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is a replication risk (3.6% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (35.7% vs 35% target).

Papers with known rater population

Coverage is usable but incomplete (21.4% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (10.7% vs 35% target).

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (21.4% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

Benchmark Brief

Retrieval

Coverage: 1 papers (3.6%)

1 papers (3.6%) mention Retrieval.

Examples: Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition

Metric Brief

accuracy

Coverage: 5 papers (17.9%)

5 papers (17.9%) mention accuracy.

Examples: Modeling Distinct Human Interaction in Web Agents , What Do LLMs Associate with Your Name? A Human-Centered Black-Box Audit of Personal Data , How much does context affect the accuracy of AI health advice?

Metric Brief

cost

Coverage: 3 papers (10.7%)

3 papers (10.7%) mention cost.

Examples: When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models , SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery , A Scalable Framework for Evaluating Health Language Models

Metric Brief

agreement

Coverage: 2 papers (7.1%)

2 papers (7.1%) mention agreement.

Examples: Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters , A Scalable Framework for Evaluating Health Language Models

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: LLM Novice Uplift on Dual-Use, In Silico Biology Tasks , When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models , Exploring Human-Machine Coexistence in Symmetrical Reality

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers

LLM Novice Uplift on Dual-Use, In Silico Biology Tasks
Chen Bo Calvin Zhang, Christina Q. Knight, Nicholas Kruus, Jason Hausenloy, Pedro Medeiros · Feb 26, 2026 · Citations: 0

Automatic Metrics

Large language models (LLMs) perform increasingly well on biology benchmarks, but it remains unclear whether they uplift novice users -- i.e., enable humans to perform better than with internet-only resources.
When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models
Satyam Kumar Navneet, Joydeep Chandra, Yong Zhang · Feb 25, 2026 · Citations: 0

Automatic Metrics

Large Language Models (LLMs) are increasingly used to ``professionalize'' workplace communication, often at the cost of linguistic identity.
Exploring Human-Machine Coexistence in Symmetrical Reality
Zhenliang Zhang · Feb 25, 2026 · Citations: 0

Automatic Metrics

In the context of the evolution of artificial intelligence (AI), the interaction between humans and AI entities has become increasingly salient, challenging the conventional human-centric paradigms of human-machine interaction.
Evaluating the Usage of African-American Vernacular English in Large Language Models
Deja Dunlap, R. Thomas McCoy · Feb 25, 2026 · Citations: 0

Automatic Metrics

In AI, most evaluations of natural language understanding tasks are conducted in standardized dialects such as Standard American English (SAE).
SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery
David Anugraha, Vishakh Padmakumar, Diyi Yang · Feb 24, 2026 · Citations: 0

Expert Verification Automatic Metrics Multi Agent

Based on this formulation, we introduce SparkMe, a multi-agent LLM interviewer that performs deliberative planning via simulated conversation rollouts to select questions with high expected utility.
"Are You Sure?": An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems
Xinfeng Li, Shenyu Dai, Kelong Zheng, Yue Xiao, Gelei Deng · Feb 24, 2026 · Citations: 0

Expert Verification Automatic Metrics

Large language model (LLM) agents are rapidly becoming trusted copilots in high-stakes domains like software development and healthcare.
An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems
Anna Martin-Boyle, William Humphreys, Martha Brown, Cara Leckey, Harmanpreet Kaur · Feb 24, 2026 · Citations: 0

Expert Verification Automatic Metrics

Current evaluation metrics for testing LLM reliability are primarily automated approaches that prioritize efficiency and scalability, but lack contextual nuance and fail to reflect how scientific domain experts assess LLM outputs in practic
PaperTrail: A Claim-Evidence Interface for Grounding Provenance in LLM-based Scholarly Q&A
Anna Martin-Boyle, Cara A. C. Leckey, Martha C. Brown, Harmanpreet Kaur · Feb 24, 2026 · Citations: 0

Automatic Metrics

Large language models (LLMs) are increasingly used in scholarly question-answering (QA) systems to help researchers synthesize vast amounts of literature.
PuppetChat: Fostering Intimate Communication through Bidirectional Actions and Micronarratives
Emma Jiren Wang, Siying Hu, Zhicong Lu · Feb 23, 2026 · Citations: 0

Automatic Metrics

As a primary channel for sustaining modern intimate relationships, instant messaging facilitates frequent connection across distances.
Hierarchical Reward Design from Language: Enhancing Alignment of Agent Behavior with Human Specifications
Zhiqin Qian, Ryan Diaz, Sangwon Seo, Vaibhav Unhelkar · Feb 20, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Long Horizon

When training artificial intelligence (AI) to perform tasks, humans often care not only about whether a task is completed but also how it is performed.
Games That Teach, Chats That Convince: Comparing Interactive and Static Formats for Persuasive Learning
Seyed Hossein Alavi, Zining Wang, Shruthi Chockkalingam, Raymond T. Ng, Vered Shwartz · Feb 20, 2026 · Citations: 0

Automatic Metrics

Interactive systems such as chatbots and games are increasingly used to persuade and educate on sustainability-related topics, yet it remains unclear how different delivery formats shape learning and persuasive outcomes when content is held
Mind the Style: Impact of Communication Style on Human-Chatbot Interaction
Erik Derner, Dalibor Kučera, Aditya Gulati, Ayoub Bagheri, Nuria Oliver · Feb 19, 2026 · Citations: 0

Automatic Metrics Web Browsing

Conversational agents increasingly mediate everyday digital interactions, yet the effects of their communication style on user experience and task success remain unclear.
Modeling Distinct Human Interaction in Web Agents
Faria Huq, Zora Zhiruo Wang, Zhanqiu Guo, Venu Arvind Arangarajan, Tianyue Ou · Feb 19, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Web Browsing

Despite rapid progress in autonomous web agents, human involvement remains essential for shaping preferences and correcting agent behavior as tasks unfold.
What Do LLMs Associate with Your Name? A Human-Centered Black-Box Audit of Personal Data
Dimitri Staufer, Kirsten Morehouse · Feb 19, 2026 · Citations: 0

Automatic Metrics

Large language models (LLMs), and conversational agents based on them, are exposed to personal data (PD) during pre-training and during user interactions.
Auditing Reciprocal Sentiment Alignment: Inversion Risk, Dialect Representation and Intent Misalignment in Transformers
Nusrat Jahan Lia, Shubhashis Roy Dipta · Feb 19, 2026 · Citations: 0

Automatic Metrics

The core theme of bidirectional alignment is ensuring that AI systems accurately understand human intent and that humans can trust AI behavior.
Surgical Activation Steering via Generative Causal Mediation
Aruna Sankaranarayanan, Amir Zur, Atticus Geiger, Dylan Hadfield-Menell · Feb 17, 2026 · Citations: 0

Automatic Metrics

Where should we intervene in a language model (LM) to control behaviors that are diffused across many tokens of a long-form response?
Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters
Xingjian Zhang, Tianhong Gao, Suliang Jin, Tianhao Wang, Teng Ye · Oct 29, 2025 · Citations: 0

Automatic Metrics

Large language models (LLMs) are increasingly used as raters for evaluation tasks.
Designing and Evaluating Chain-of-Hints for Scientific Question Answering
Anubhav Jangra, Smaranda Muresan · Oct 24, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

Using the best performing LLM as the backbone of a quantitative study with 41 participants, we uncover distinct user preferences across hinting strategies, and identify the limitations of automatic evaluation metrics to capture them.
Everything is Plausible: Investigating the Impact of LLM Rationales on Human Notions of Plausibility
Shramay Palta, Peter Rankel, Sarah Wiegreffe, Rachel Rudinger · Oct 9, 2025 · Citations: 0

Automatic Metrics

We investigate the degree to which human plausibility judgments of multiple-choice commonsense benchmark answers are subject to influence by (im)plausibility arguments for or against an answer, in particular, using rationales generated by L
ClearFairy: Capturing Creative Workflows through Decision Structuring, In-Situ Questioning, and Rationale Inference
Kihoon Son, DaEun Choi, Tae Soo Kim, Young-Ho Kim, Sangdoo Yun · Sep 18, 2025 · Citations: 0

Critique Edit Automatic Metrics

Furthermore, exploratory applications demonstrate that captured steps can enhance generative AI agents in Figma, yielding predictions better aligned with professionals and producing coherent outcomes.
The AI Memory Gap: Users Misremember What They Created With AI or Without
Tim Zindulka, Sven Goller, Daniela Fernandes, Robin Welsch, Daniel Buschek · Sep 15, 2025 · Citations: 0

Automatic Metrics

Our findings reveal a significant gap in memory: After AI use, the odds of correct attribution dropped, with the steepest decline in mixed human-AI workflows, where either the idea or elaboration was created with AI.
When Algorithms Meet Artists: Semantic Compression of Artists' Concerns in the Public AI-Art Debate
Ariya Mukherjee-Gandhi, Oliver Muellerklein · Aug 5, 2025 · Citations: 0

Automatic Metrics

Artists occupy a paradoxical position in generative AI: their work trains the models reshaping creative labor.
Toward Beginner-Friendly LLMs for Language Learning: Controlling Difficulty in Conversation
Meiqing Jin, Liam Dugan, Chris Callison-Burch · Jun 4, 2025 · Citations: 0

Automatic Metrics

We further introduce a new token-level evaluation metric, Token Miss Rate (TMR), that quantifies the proportion of incomprehensible tokens per utterance and correlates strongly with human judgments.
Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition
Zheng Hui, Xiaokai Wei, Yexi Jiang, Kevin Gao, Chen Wang · Apr 26, 2025 · Citations: 0

Pairwise Preference Automatic Metrics Multi Agent

These domains typically involve fixed content and passive consumption, where user preferences can be matched by genre or theme.
How much does context affect the accuracy of AI health advice?
Prashant Garg, Thiemo Fetzer · Apr 25, 2025 · Citations: 0

Automatic Metrics

English-language performance does not reliably generalise across contexts, underscoring the need for multilingual, domain-specific evaluation before deployment in public-health communication.
A Scalable Framework for Evaluating Health Language Models
Neil Mallinar, A. Ali Heydari, Xin Liu, Anthony Z. Faranesh, Brent Winslow · Mar 30, 2025 · Citations: 0

Rubric RatingExpert Verification Automatic Metrics

As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety.
HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs
Tin Nguyen, Logan Bolton, Mohammad Reza Taesiri, Trung Bui, Anh Totti Nguyen · Mar 3, 2025 · Citations: 0

Automatic Metrics

A response mixed of factual and non-factual statements poses a challenge for humans to verify and accurately base their decisions on.
Usability Study of Security Features in Programmable Logic Controllers
Karen Li, Kopo M. Ramokapane, Awais Rashid · Aug 4, 2022 · Citations: 0

Automatic Metrics Web Browsing

Our results uncover various misperceptions about the security controls and how design constraints, e.g., safety and lack of regular updates due to the long-term nature of such systems, provide significant challenges to the realization of mo

CS.HC + Automatic Metrics Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers

Related Hubs