HFEPX Hub

Human Eval Papers (Last 90 Days)

Updated from current HFEPX corpus (Apr 17, 2026). 60 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Apr 17, 2026). 60 papers are grouped in this hub page. Common evaluation modes: Human Eval, Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Multi Dim Rubric. Frequent quality control: Inter Annotator Agreement Reported. Frequently cited benchmark: AIME. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 22, 2026.

Papers: 60 Last published: Mar 22, 2026 Global RSS Tag RSS

Human EvalLast 90d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (60) Replication-Ready Only (4)

High-Signal Coverage

100.0%

60 / 60 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

4 papers are replication-ready (benchmark + metric + explicit evaluation mode).
4 papers support judge-vs-human agreement analysis.
7 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Start with the top 2 papers in “Start Here”, then validate assumptions in the protocol matrix.

Need evaluators for this research workflow?

Post a Job →

Why This Matters For Eval Research

42.1% of papers report explicit human-feedback signals, led by pairwise preferences.
human evaluation appears in 63.3% of papers in this hub.
AIME is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

1 sampled papers report both human evaluation and LLM-as-judge, supporting direct agreement checks.
Most common quality-control signal is inter-annotator agreement reporting (6.7% of papers).
Rater context is mostly domain experts, and annotation is commonly multi-dimensional rubrics; use this to scope replication staffing.

Benchmark Interpretation

AIME appears in 2.6% of hub papers (1/60); use this cohort for benchmark-matched comparisons.
Correctbench appears in 2.6% of hub papers (1/60); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 28.9% of hub papers (11/60); compare with a secondary metric before ranking methods.
agreement is reported in 15.8% of hub papers (6/60); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Moderate: Papers with explicit human feedback

Coverage is usable but incomplete (42.1% vs 45% target).
Moderate: Papers reporting quality controls

Coverage is usable but incomplete (18.4% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (23.7% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (71.1% vs 35% target).
Strong: Papers with known rater population

Coverage is strong (36.8% vs 35% target).
Moderate: Papers with known annotation unit

Coverage is usable but incomplete (28.9% vs 35% target).

Strengths

Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.

Known Gaps

Only 18.4% of papers report quality controls; prioritize calibration/adjudication evidence.

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (AIME vs Correctbench) before comparing methods.
Track metric sensitivity by reporting both accuracy and agreement.

Recommended Queries (Expanded)

Recommended Queries

Judge vs Human Agreement Benchmark Slice: AIME Metric Slice: accuracy IAA-Reported Evaluations Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabe…

Highest protocol score with explicit human/eval signal plus WebArena.

Strongest benchmark reference

Personalized RewardBench: Evaluating Reward Models with Human Aligned…

Rewardbench with accuracy gives a fast comparison anchor.

Strongest recent paper

Is this Idea Novel? An Automated Benchmark for Judgment of Research I…

Useful for current practice scanning; published Mar 11, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling
Mar 22, 2026 · Citations: 0 · Score: 10.0

HF: Demonstrations · Eval: Human Eval, Llm As Judge · Benchmark: WebArena · Metric: Precision
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Apr 8, 2026 · Citations: 0 · Score: 8.0

HF: Pairwise Preference, Rubric Rating · Eval: Human Eval, Automatic Metrics · Benchmark: Rewardbench · Metric: Accuracy
Is this Idea Novel? An Automated Benchmark for Judgment of Research Ideas
Mar 11, 2026 · Citations: 0 · Score: 7.5

HF: Rubric Rating · Eval: Human Eval · Benchmark: Rinobench · Metric: Not Reported
LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias
Mar 31, 2026 · Citations: 0 · Score: 7.5

HF: Rubric Rating · Eval: Human Eval · Benchmark: Not Reported · Metric: Kappa
Grounding Arabic LLMs in the Doha Historical Dictionary: Retrieval-Augmented Understanding of Quran and Hadith
Mar 25, 2026 · Citations: 0 · Score: 7.5

HF: Not reported · Eval: Human Eval, Llm As Judge · Benchmark: Not Reported · Metric: Accuracy
Validating Political Position Predictions of Arguments
Feb 20, 2026 · Citations: 0 · Score: 7.0

HF: Pairwise Preference · Eval: Human Eval · Benchmark: Not Reported · Metric: Agreement

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling Mar 22, 2026	Yes Demonstrations	Human Eval , Llm As Judge	WebArena , ToolBench	Precision , Pass@1	Not Reported
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization Apr 8, 2026	Yes Pairwise Preference , Rubric Rating	Human Eval , Automatic Metrics	Rewardbench	Accuracy , Helpfulness	Not Reported
Is this Idea Novel? An Automated Benchmark for Judgment of Research Ideas Mar 11, 2026	Yes Rubric Rating	Human Eval	Rinobench	Not Reported	Gold Questions
LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias Mar 31, 2026	Yes Rubric Rating	Human Eval	Not Reported	Kappa , Agreement	Inter Annotator Agreement Reported , Adjudication
Grounding Arabic LLMs in the Doha Historical Dictionary: Retrieval-Augmented Understanding of Quran and Hadith Mar 25, 2026	No Not Reported	Human Eval , Llm As Judge	Not Reported	Accuracy , Kappa	Inter Annotator Agreement Reported
Validating Political Position Predictions of Arguments Feb 20, 2026	Yes Pairwise Preference	Human Eval	Not Reported	Agreement	Gold Questions , Inter Annotator Agreement Reported
A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations Mar 26, 2026	Yes Expert Verification	Human Eval	Cpgbench	Not Reported	Not Reported
CounselReflect: A Toolkit for Auditing Mental-Health Dialogues Mar 31, 2026	Yes Rubric Rating , Expert Verification	Human Eval	Not Reported	Not Reported	Adjudication
Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning Mar 29, 2026	Yes Expert Verification	Human Eval , Automatic Metrics	Not Reported	Accuracy	Not Reported
IntelliAsk: Learning to Ask High-Quality Research Questions via RLVR Jan 23, 2026	Yes Pairwise Preference , Expert Verification	Human Eval	Writingbench	Not Reported	Not Reported
Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models Feb 21, 2026	Yes Pairwise Preference	Human Eval	GSM8K , AIME	Not Reported	Not Reported
RebuttalAgent: Strategic Persuasion in Academic Rebuttal via Theory of Mind Jan 22, 2026	Yes Pairwise Preference , Critique Edit	Human Eval	Rebuttalbench	Not Reported	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	AgentHER: Hindsight Experience Replay for LLM Agent…	Personalized RewardBench: Evaluating Reward Models…	Is this Idea Novel? An Automated Benchmark for Judg…
Human Feedback	Demonstrations	Pairwise Preference, Rubric Rating	Rubric Rating
Evaluation Modes	Human Eval, Llm As Judge	Human Eval, Automatic Metrics	Human Eval
Benchmarks	WebArena, ToolBench	Rewardbench	Rinobench
Metrics	Precision, Pass@1	Accuracy, Helpfulness	Not reported
Quality Controls	Not reported	Not reported	Gold Questions
Rater Population	Unknown	Unknown	Domain Experts
Annotation Unit	Trajectory	Pairwise	Multi Dim Rubric

Research Utility Snapshot

Human Feedback Mix

Pairwise Preference (8)
Rubric Rating (6)
Expert Verification (4)
Critique Edit (2)

Evaluation Modes

Human Eval (38)
Automatic Metrics (17)
Llm As Judge (4)
Simulation Env (3)

Top Benchmarks

AIME (1)
Correctbench (1)
Cpgbench (1)
Cruxeval (1)

Top Metrics

Accuracy (11)
Agreement (6)
F1 (4)
Bleu (3)

Rater Population Mix

Domain Experts (13)
Mixed (1)

Quality Controls

Inter Annotator Agreement Reported (4)
Adjudication (3)
Gold Questions (2)

Coverage diagnostics (sample-based): human-feedback 26.7% · benchmarks 15.0% · metrics 58.3% · quality controls 11.7%.

Top Papers

AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling
Liang Ding · Mar 22, 2026 · Citations: 0

Demonstrations Human EvalLlm As Judge Long Horizon

LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely…
CounselReflect: A Toolkit for Auditing Mental-Health Dialogues
Yahan Li, Chaohao Du, Zeyang Li, Christopher Chun Kuizon, Shupeng Cheng · Mar 31, 2026 · Citations: 0

Rubric RatingExpert Verification Human Eval Web Browsing

The system integrates two families of evaluation signals: (i) 12 model-based metrics produced by task-specific predictors, and (ii) rubric-based metrics that extend coverage via a literature-derived library (69 metrics) and user-defined…
Is this Idea Novel? An Automated Benchmark for Judgment of Research Ideas
Tim Schopf, Michael Färber · Mar 11, 2026 · Citations: 0

Rubric Rating Human Eval

To address this, we introduce RINoBench, the first comprehensive benchmark for large-scale evaluation of research idea novelty judgments.
Validating Political Position Predictions of Arguments
Jordan Robinson, Angus R. Williams, Katie Atkinson, Anthony G. Cohn · Feb 20, 2026 · Citations: 0

Pairwise Preference Human Eval

Real-world knowledge representation often requires capturing subjective, continuous attributes -- such as political positions -- that conflict with pairwise validation, the widely accepted gold standard for human evaluation.
LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias
Filip J. Kucia, Anirban Chakraborty, Anna Wróblewska · Mar 31, 2026 · Citations: 0

Rubric Rating Human Eval

We present a systematic evaluation of instruction-tuned LLMs across three open essay-scoring datasets (ASAP 2.0, ELLIPSE, and DREsS) that cover both holistic and analytic scoring.
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Qiyao Ma, Dechen Gao, Rui Cai, Boqi Zhao, Hanchu Zhou · Apr 8, 2026 · Citations: 0

Pairwise PreferenceRubric Rating Human EvalAutomatic Metrics

Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values.
Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning
Zhiwen You, Xi Chen, Aniket Vashishtha, Simo Du, Gabriel Erion-Barner · Mar 29, 2026 · Citations: 0

Expert Verification Human EvalAutomatic Metrics Multi Agent

In this work, we propose a counterfactual multi-agent diagnostic framework inspired by clinician training that makes hypothesis testing explicit and evidence-grounded.
Evaluating Austrian A-Level German Essays with Large Language Models for Automated Essay Scoring
Jonas Kubesch, Lena Huber, Clemens Havas · Mar 6, 2026 · Citations: 0

Rubric Rating Human Eval

This paper investigates the application of state-of-the-art open-weight LLMs for the grading of Austrian A-level German texts, with a particular focus on rubric-based evaluation.
A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations
Andong Tan, Shuyu Dai, Jinglu Wang, Fengtao Zhou, Yan Lu · Mar 26, 2026 · Citations: 0

Expert Verification Human Eval

To address this gap, we introduce CPGBench, an automated framework benchmarking the clinical guideline detection and adherence capabilities of LLMs in multi-turn conversations.
IntelliAsk: Learning to Ask High-Quality Research Questions via RLVR
Karun Sharma, Vidushee Vats, Shengzhi Li, Yuxiang Wang, Zhongtian Sun · Jan 23, 2026 · Citations: 0

Pairwise PreferenceExpert Verification Human Eval

Peer review relies on substantive, evidence-based questions, yet current LLMs generate surface-level queries that perform worse than human reviewer questions in expert evaluation.
PONTE: Personalized Orchestration for Natural Language Trustworthy Explanations
Vittoria Vineis, Matteo Silvestri, Lorenzo Antonelli, Filippo Betello, Gabriele Tolomei · Mar 6, 2026 · Citations: 0

Pairwise Preference Human Eval

To address these challenges, we present PONTE (Personalized Orchestration for Natural language Trustworthy Explanations), a human-in-the-loop framework for adaptive and reliable XAI narratives.
VRM: Teaching Reward Models to Understand Authentic Human Preferences
Biao Liu, Ning Xu, Junming Yang, Hao Xu, Xin Geng · Mar 5, 2026 · Citations: 0

Pairwise Preference Human Eval

Large Language Models (LLMs) have achieved remarkable success across diverse natural language tasks, yet the reward models employed for aligning LLMs often encounter challenges of reward hacking, where the approaches predominantly rely on…
Beyond Rows to Reasoning: Agentic Retrieval for Multimodal Spreadsheet Understanding and Editing
Anmol Gulati, Sahil Sen, Waqar Sarguroh, Kevin Paul · Mar 6, 2026 · Citations: 0

Human EvalAutomatic Metrics Long Horizon

We introduce Beyond Rows to Reasoning (BRTR), a multimodal agentic framework for spreadsheet understanding that replaces single-pass retrieval with an iterative tool-calling loop, supporting end-to-end Excel workflows from complex analysis…
Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models
Abraham Paul Elenjical, Vivek Hruday Kavuri, Vasudeva Varma · Feb 21, 2026 · Citations: 0

Pairwise Preference Human Eval

We introduce a psychologically grounded metacognitive framework that operationalizes Ann Brown's regulatory cycle (Planning, Monitoring, and Evaluation) as a structured prompting architecture, and study its integration within a lightweight…
RebuttalAgent: Strategic Persuasion in Academic Rebuttal via Theory of Mind
Zhitao He, Zongwei Lyu, Yi R Fung · Jan 22, 2026 · Citations: 0

Pairwise PreferenceCritique Edit Human Eval

In this paper, we introduce RebuttalAgent, the first framework to ground academic rebuttal in Theory of Mind (ToM), operationalized through a ToM-Strategy-Response (TSR) framework that models reviewer mental state, formulates persuasion…
FrameRef: A Framing Dataset and Simulation Testbed for Modeling Bounded Rational Information Health
Victor De Lima, Jiqun Liu, Grace Hui Yang · Feb 17, 2026 · Citations: 0

Human EvalSimulation Env Long Horizon

Within this framework, we construct framing-sensitive agent personas by fine-tuning language models with framing-conditioned loss attenuation, inducing targeted biases while preserving overall task competence.
DataSTORM: Deep Research on Large-Scale Databases using Exploratory Data Analysis and Data Storytelling
Shicheng Liu, Yucheng Jiang, Sajid Farook, Camila Nicollier Sanchez, David Fernando Castro Pena · Apr 7, 2026 · Citations: 0

Human Eval Long Horizon

Deep research with Large Language Model (LLM) agents is emerging as a powerful paradigm for multi-step information discovery, synthesis, and analysis.
Discovering Implicit Large Language Model Alignment Objectives
Edward Chen, Sanmi Koyejo, Carlos Guestrin · Feb 17, 2026 · Citations: 0

Rubric Rating Human Eval

To address these limitations, we introduce Obj-Disco, a framework that automatically decomposes an alignment reward signal into a sparse, weighted combination of human-interpretable natural language objectives.
EvoScientist: Towards Multi-Agent Evolving AI Scientists for End-to-End Scientific Discovery
Yougang Lyu, Xi Zhang, Xinhao Yi, Yuyue Zhao, Shuyu Guo · Mar 9, 2026 · Citations: 0

Human Eval Multi Agent

To address this, we introduce EvoScientist, an evolving multi-agent AI scientist framework that continuously improves research strategies through persistent memory and self-evolution.
Balancing Multiple Objectives in Urban Traffic Control with Reinforcement Learning from AI Feedback
Chenyang Zhao, Vinny Cahill, Ivana Dusparic · Feb 24, 2026 · Citations: 0

Pairwise PreferenceRlaif Or Synthetic Feedback Human Eval

Preference-based RL offers an appealing alternative by learning from human preferences over pairs of behavioural outcomes.
Strengthening Human-Centric Chain-of-Thought Reasoning Integrity in LLMs via a Structured Prompt Framework
Jiling Zhou, Aisvarya Adeseye, Seppo Virtanen, Antti Hakkala, Jouni Isoaho · Apr 6, 2026 · Citations: 0

Human EvalAutomatic Metrics

However, its reliability in security-sensitive analytical tasks remains insufficiently examined, particularly under structured human evaluation.
Grounding Arabic LLMs in the Doha Historical Dictionary: Retrieval-Augmented Understanding of Quran and Hadith
Somaya Eltanbouly, Samer Rashwani · Mar 25, 2026 · Citations: 0

Human EvalLlm As Judge

Gemini also serves as an LLM-as-a-judge system for automatic evaluation in our experiments.
Habibi: Laying the Open-Source Foundation of Unified-Dialectal Arabic Speech Synthesis
Yushen Chen, Junzhe Liu, Yujie Tu, Zhikang Niu, Yuzhe Liang · Jan 20, 2026 · Citations: 0

Human Eval Long Horizon

Key barriers include substantial cross-dialect lexical and phonological divergence, scarce synthesis-grade data, and the absence of a standardized multi-dialect evaluation benchmark.
Less Is More? Selective Visual Attention to High-Importance Regions for Multimodal Radiology Summarization
Mst. Fahmida Sultana Naznin, Adnan Ibney Faruq, Mushfiqur Rahman, Niloy Kumar Mondal, Md. Mehedi Hasan Shawon · Mar 31, 2026 · Citations: 0

Human EvalAutomatic Metrics

Through controlled ablations on MIMIC-CXR benchmark, we show that selectively focusing on pathology-relevant visual patches rather than full images yields substantially better performance.
Learning to Predict Future-Aligned Research Proposals with Language Models
Heng Wang, Pengcheng Jiang, Jiashuo Sun, Zhiyi Shi, Haofei Yu · Mar 28, 2026 · Citations: 0

Human EvalAutomatic Metrics

Across Llama-3.1 and Qwen2.5 models, future-aligned tuning improves future alignment over unaligned baselines (up to +10.6% overall FAS), and domain-expert human evaluation corroborates improved proposal quality.
Evaluating LLM-Based Translation of a Low-Resource Technical Language: The Medical and Philosophical Greek of Galen
James L. Zainaldin, Cameron Pattison, Manuela Marai, Jacob Wu, Mark J. Schiefsky · Feb 27, 2026 · Citations: 0

Human EvalAutomatic Metrics

This study presents the first systematic, reference-free human evaluation of large language model (LLM) machine translation (MT) for Ancient Greek (AG) technical prose.
CARE: An Explainable Computational Framework for Assessing Client-Perceived Therapeutic Alliance Using Large Language Models
Anqi Li, Chenxiao Wang, Yu Lu, Renjun Xu, Lizhi Ma · Feb 24, 2026 · Citations: 0

Human EvalAutomatic Metrics

Experiments show that CARE outperforms leading LLMs and substantially reduces the gap between counselor evaluations and client-perceived alliance, achieving over 70% higher Pearson correlation with client ratings.
Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System
Pavithra PM Nair, Preethu Rose Anish · Feb 20, 2026 · Citations: 0

Human EvalAutomatic Metrics

Vichara surpasses existing judgment prediction benchmarks on both datasets, with GPT-4o mini achieving the highest performance (F1: 81.5 on PredEx, 80.3 on ILDC_expert), followed by Llama-3.1-8B.
How Long Reasoning Chains Influence LLMs' Judgment of Answer Factuality
Minzhu Tu, Shiyu Ni, Keping Bi · Apr 8, 2026 · Citations: 0

Human EvalAutomatic Metrics

Large language models (LLMs) has been widely adopted as a scalable surrogate for human evaluation, yet such judges remain imperfect and susceptible to surface-level biases.
Voxtral TTS
Mistral-AI, :, Alexander H. Liu, Alexis Tacnet, Andy Ehrenberg · Mar 26, 2026 · Citations: 0

Human EvalAutomatic Metrics

In human evaluations conducted by native speakers, Voxtral TTS is preferred for multilingual voice cloning due to its naturalness and expressivity, achieving a 68.4\% win rate over ElevenLabs Flash v2.5.
Translation Asymmetry in LLMs as a Data Augmentation Factor: A Case Study for 6 Romansh Language Varieties
Jannis Vamvas, Ignacio Pérez Prat, Angela Heldstab, Dominic P. Fischer, Sina Ahmadi · Mar 26, 2026 · Citations: 0

Human EvalAutomatic Metrics

A human evaluation confirms that our experiments yield the first model that generates fluent translations in the individual Romansh varieties.
When Hate Meets Facts: LLMs-in-the-Loop for Check-worthiness Detection in Hate Speech
Nicolás Benjamín Ocampo, Tommaso Caselli, Davide Ceolin · Mar 26, 2026 · Citations: 0

Human EvalAutomatic Metrics

We validate it through extensive human evaluation, and show that our LLM-in-the-loop framework reduces human effort without compromising the annotation quality of the data.
Cross-Modal Rationale Transfer for Explainable Humanitarian Classification on Social Media
Thi Huyen Nguyen, Koustav Rudra, Wolfgang Nejdl · Mar 19, 2026 · Citations: 0

Human EvalAutomatic Metrics

Experiments are conducted over CrisisMMD benchmark dataset, and results show that our proposed method boosts the classification Macro-F1 by 2-35% while extracting accurate text tokens and image patches as rationales.
Distill and Align Decomposition for Enhanced Claim Verification
Jabez Magomere, Elena Kochkina, Samuel Mensah, Simerjot Kaur, Fernando Acero · Feb 25, 2026 · Citations: 0

Human EvalAutomatic Metrics

Across six evaluation settings, our trained 8B decomposer improves downstream verification performance to (71.75%) macro-F1, outperforming prompt-based approaches ((+1.99), (+6.24)) and existing RL methods ((+5.84)).
Claim Automation using Large Language Model
Zhengda Mo, Zhiyu Quan, Eli O'Donohue, Kaiwen Zhong · Feb 18, 2026 · Citations: 0

Human EvalAutomatic Metrics

We assess this module using a multi-dimensional evaluation framework that combines automated semantic similarity metrics with human evaluation, enabling a rigorous examination of both practical utility and predictive accuracy.
Can Small Language Models Handle Context-Summarized Multi-Turn Customer-Service QA? A Synthetic Data-Driven Comparative Evaluation
Lakshan Cooray, Deshan Sumanathilaka, Pattigadapa Venkatesh Raju · Jan 31, 2026 · Citations: 0

Human EvalLlm As Judge

Nine instruction-tuned low-parameterized SLMs are evaluated against three commercial LLMs using lexical and semantic similarity metrics alongside qualitative assessments, including human evaluation and LLM-as-a-judge methods.
Sell More, Play Less: Benchmarking LLM Realistic Selling Skill
Xuanbo Su, Wenhao Hu, Haibo Su, Yunzhang Chen, Le Zhan · Apr 8, 2026 · Citations: 0

Human EvalSimulation Env

We introduce SalesLLM benchmark, a bilingual (ZH/EN) benchmark derived from realistic applications covering Financial Services and Consumer Goods, built from 30,074 scripted configurations and 1,805 curated multi-turn scenarios with…
AgenticSum: An Agentic Inference-Time Framework for Faithful Clinical Text Summarization
Fahmida Liza Piya, Rahmatollah Beheshti · Feb 23, 2026 · Citations: 0

Human EvalLlm As Judge

We present AgenticSum, an inference-time, agentic framework that separates context selection, generation, verification, and targeted correction to reduce hallucinated content.
An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks
Gabriel Stefan, Adrian-Marius Dumitran · Apr 9, 2026 · Citations: 0

Human Eval

We propose an agentic evaluation architecture comprising a multimodal screening agent, a heterogeneous jury of five evaluative agents, and a meta-agent for verdict synthesis and human escalation.
STRIDE-ED: A Strategy-Grounded Stepwise Reasoning Framework for Empathetic Dialogue Systems
Hongru Ji, Yuyin Fan, Meng Zhao, Xianghua Li, Lianwei Wu · Apr 8, 2026 · Citations: 0

Human Eval

To support effective learning, we develop a strategy-aware data refinement pipeline integrating LLM-based annotation, multi-model consistency-weighted evaluation, and dynamic sampling to construct high-quality training data aligned with…
PRCCF: A Persona-guided Retrieval and Causal-aware Cognitive Filtering Framework for Emotional Support Conversation
Yanxin Luo, Xiaoyu Zhang, Jing Li, Yan Gao, Donghong Han · Apr 2, 2026 · Citations: 0

Human Eval

Extensive experiments on the ESConv dataset demonstrate that PRCCF outperforms state-of-the-art baselines on both automatic metrics and human evaluations.
Logarithmic Scores, Power-Law Discoveries: Disentangling Measurement from Coverage in Agent-Based Evaluation
HyunJoon Jung, William Na · Apr 1, 2026 · Citations: 0

Human Eval

LLM-based agent judges are an emerging approach to evaluating conversational AI, yet a fundamental uncertainty remains: can we trust their assessments, and if so, how many are needed?
ContextClaim: A Context-Driven Paradigm for Verifiable Claim Detection
Yufeng Li, Rrubaa Panchendrarajan, Arkaitz Zubiaga · Mar 31, 2026 · Citations: 0

Human Eval

Through component analysis, human evaluation, and error analysis, we further examine when and why the retrieved context contributes to more reliable verifiability judgments.
Open Machine Translation for Esperanto
Ona de Gibert, Lluís de Gibert · Mar 31, 2026 · Citations: 0

Human Eval

In this work, we present the first comprehensive evaluation of open-source MT systems for Esperanto, comparing rule-based systems, encoder-decoder models, and LLMs across model sizes.
Measuring What Matters -- or What's Convenient?: Robustness of LLM-Based Scoring Systems to Construct-Irrelevant Factors
Cole Walsh, Rodica Ivan · Mar 26, 2026 · Citations: 0

Human Eval

These systems commonly achieve performance levels comparable to or superior than trained human raters, but have frequently been demonstrated to be vulnerable to the influence of construct-irrelevant factors (i.e., features of responses that…
LLMs Do Not Grade Essays Like Humans
Jerin George Mathew, Sumayya Taher, Anindita Kundu, Denilson Barbosa · Mar 24, 2026 · Citations: 0

Human Eval

Large language models have recently been proposed as tools for automated essay scoring, but their agreement with human grading remains unclear.
Preconditioned Test-Time Adaptation for Out-of-Distribution Debiasing in Narrative Generation
Hanwen Shen, Ting Ying, Jiajie Lu, Shanshan Wang · Mar 14, 2026 · Citations: 0

Human Eval

Across multiple benchmarks and human evaluations, CAP-TTA effectively reduces toxicity/bias score with significantly lower latency than standard optimization methods (e.g., AdamW or SGD).
Enhancing Debunking Effectiveness through LLM-based Personality Adaptation
Pietro Dell'Oglio, Alessandro Bondielli, Francesco Marcelloni, Lucia C. Passaro · Mar 10, 2026 · Citations: 0

Human Eval

To assess the effectiveness of these transformations, we employ a separate LLM as an automated evaluator simulating corresponding personality traits, thereby eliminating the need for costly human evaluation panels.
Evaluating LLM-Based Grant Proposal Review via Structured Perturbations
William Thorne, Joseph James, Yang Wang, Chenghua Lin, Diana Maynard · Mar 9, 2026 · Citations: 0

Human Eval

As AI-assisted grant proposals outpace manual review capacity in a kind of ``Malthusian trap'' for the research ecosystem, this paper investigates the capabilities and limitations of LLM-based grant reviewing for high-stakes evaluation.
TildeOpen LLM: Leveraging Curriculum Learning to Achieve Equitable Language Representation
Toms Bergmanis, Martins Kronis, Ingus Jānis Pretkalniņš, Dāvis Nicmanis, Jeļizaveta Jeļinska · Mar 9, 2026 · Citations: 0

Human Eval

Evaluation across multiple multilingual benchmarks shows that TildeOpen surpasses existing open-weight models in text generation and comprehension, particularly for Baltic, Finno-Ugric, and Slavic languages.
Accent Vector: Controllable Accent Manipulation for Multilingual TTS Without Accented Data
Thanathai Lertpetchpun, Thanapat Trachu, Jihwan Lee, Tiantian Feng, Dani Byrd · Mar 8, 2026 · Citations: 0

Human Eval

Objective and human evaluations confirm the effectiveness of Accent Vector for fine-grained and compositional accent control.
The Art That Poses Back: Assessing AI Pastiches after Contemporary Artworks
Anca Dinu, Andreiana Mihail, Andra-Maria Florescu, Claudiu Creanga · Mar 6, 2026 · Citations: 0

Human Eval

The analysis combines human evaluation with computational methods aimed at detecting visual and stylistic similarities or divergences between the original works and their AI-produced renditions.
TikZilla: Scaling Text-to-TikZ with High-Quality Data and Reinforcement Learning
Christian Greisinger, Steffen Eger · Mar 3, 2026 · Citations: 0

Human Eval

Extensive human evaluations with over 1,000 judgments show that TikZilla improves by 1.5-2 points over its base models on a 5-point scale, surpasses GPT-4o by 0.5 points, and matches GPT-5 in the image-based evaluation, while operating at…
When Numbers Tell Half the Story: Human-Metric Alignment in Topic Model Evaluation
Thibault Prouteau, Francis Lareau, Nicolas Dugué, Jean-Charles Lamirel, Christophe Malaterre · Mar 2, 2026 · Citations: 0

Human Eval

Existing methods often rely on automated metrics like topic coherence and diversity, which may not fully align with human judgment.
TARAZ: Persian Short-Answer Question Benchmark for Cultural Evaluation of Language Models
Reihaneh Iranmanesh, Saeedeh Davoudi, Pasha Abrishamchian, Ophir Frieder, Nazli Goharian · Feb 26, 2026 · Citations: 0

Human Eval

This paper presents a comprehensive evaluation framework for assessing the cultural competence of large language models (LLMs) in Persian.
Improving Implicit Discourse Relation Recognition with Natural Language Explanations from LLMs
Heng Wang, Changxing Wu · Feb 25, 2026 · Citations: 0

Human Eval

Experimental results on PDTB demonstrate that our approach significantly improves IDRR performance, while human evaluation further confirms that the generated explanations enhance model interpretability.
Pressure Reveals Character: Behavioural Alignment Evaluation at Depth
Nora Petrova, John Burden · Feb 24, 2026 · Citations: 0

Human Eval

While alignment failures increasingly cause real-world harm, comprehensive evaluation frameworks with realistic multi-turn scenarios remain lacking.
Using LLMs for Knowledge Component-level Correctness Labeling in Open-ended Coding Problems
Zhangqi Duan, Arnav Kankaria, Dhruv Kartik, Andrew Lan · Feb 19, 2026 · Citations: 0

Human Eval

Human evaluation further demonstrates substantial agreement between LLM and expert annotations.
BETA-Labeling for Multilingual Dataset Construction in Low-Resource IR
Md. Najib Hasan, Mst. Jannatun Ferdous Rain, Fyad Mohammed, Nazmul Siddique · Feb 16, 2026 · Citations: 0

Human Eval

Manual annotation is expensive and difficult to scale, while using large language models (LLMs) as automated annotators introduces concerns about label reliability, bias, and evaluation validity.
Do LLMs Truly Benefit from Longer Context in Automatic Post-Editing?
Ahrii Kim, Seong-heum Kim · Jan 27, 2026 · Citations: 0

Human Eval

Our results show that proprietary LLMs achieve near human-level APE quality even with simple one-shot prompting, regardless of whether document context is provided.

Related Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now