HFEPX Hub

Critique Edit Papers (Last 60 Days)

Updated from current HFEPX corpus (Mar 1, 2026). 15 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Mar 1, 2026). 15 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Common annotation unit: Multi Dim Rubric. Frequent quality control: Adjudication. Frequently cited benchmark: ContentBench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 15, 2026.

Papers: 15 Last published: Feb 15, 2026 Global RSS Tag RSS

Critique EditLast 60d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (15) Replication-Ready Only (2)

High-Signal Coverage

100.0%

15 / 15 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

2 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
2 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Use this page for scouting only; collect additional papers before attempting replication-critical comparisons.

Why This Matters (Expanded)

Why This Matters For Eval Research

100% of papers report explicit human-feedback signals, led by critique/edit feedback.
automatic metrics appears in 33.3% of papers in this hub.
ContentBench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Notes (Expanded)

Protocol Takeaways

Most common quality-control signal is adjudication (6.7% of papers).
Rater context is mostly domain experts, and annotation is commonly multi-dimensional rubrics; use this to scope replication staffing.
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Benchmark Interpretation

ContentBench appears in 6.7% of hub papers (1/15); use this cohort for benchmark-matched comparisons.
HLE appears in 6.7% of hub papers (1/15); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 13.3% of hub papers (2/15); compare with a secondary metric before ranking methods.
cost is reported in 13.3% of hub papers (2/15); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Strong: Papers with explicit human feedback

Coverage is strong (100% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (13.3% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (26.7% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (26.7% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (20% vs 35% target).
Moderate: Papers with known annotation unit

Coverage is usable but incomplete (26.7% vs 35% target).

Strengths

Strong human-feedback signal (100% of papers).

Known Gaps

Only 13.3% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (20% coverage).

Suggested Next Analyses

Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.
Stratify by benchmark (ContentBench vs HLE) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries (Expanded)

Recommended Queries

Human Eval Protocols Benchmark Slice: ContentBench Metric Slice: accuracy Recent High-Signal Papers

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam
Feb 15, 2026 · Citations: 0 · Score: 9.5

HF: Expert Verification, Critique Edit · Eval: Automatic Metrics · Benchmark: HLE · Metric: Accuracy
Can Large Language Models Replace Human Coders? Introducing ContentBench
Feb 23, 2026 · Citations: 0 · Score: 8.0

HF: Critique Edit · Eval: Automatic Metrics · Benchmark: ContentBench · Metric: Agreement
RebuttalAgent: Strategic Persuasion in Academic Rebuttal via Theory of Mind
Jan 22, 2026 · Citations: 0 · Score: 6.0

HF: Pairwise Preference, Critique Edit · Eval: Human Eval · Benchmark: Rebuttalbench · Metric: Not Reported
From Pixels to Policies: Reinforcing Spatial Reasoning in Language Models for Content-Aware Layout Design
Feb 14, 2026 · Citations: 0 · Score: 6.0

HF: Critique Edit · Eval: Simulation Env · Benchmark: Not Reported · Metric: Latency
CAMEL: Confidence-Gated Reflection for Reward Modeling
Feb 24, 2026 · Citations: 0 · Score: 6.0

HF: Pairwise Preference, Critique Edit · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
VULCA-Bench: A Multicultural Vision-Language Benchmark for Evaluating Cultural Understanding
Jan 12, 2026 · Citations: 0 · Score: 6.0

HF: Critique Edit · Eval: Not reported · Benchmark: Vulca Bench · Metric: Not Reported

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam Feb 15, 2026	Yes Expert Verification , Critique Edit	Automatic Metrics	HLE	Accuracy	Adjudication
Can Large Language Models Replace Human Coders? Introducing ContentBench Feb 23, 2026	Yes Critique Edit	Automatic Metrics	ContentBench	Agreement , Cost	Not Reported
RebuttalAgent: Strategic Persuasion in Academic Rebuttal via Theory of Mind Jan 22, 2026	Yes Pairwise Preference , Critique Edit	Human Eval	Rebuttalbench	Not Reported	Not Reported
From Pixels to Policies: Reinforcing Spatial Reasoning in Language Models for Content-Aware Layout Design Feb 14, 2026	Yes Critique Edit	Simulation Env	Not Reported	Latency	Not Reported
CAMEL: Confidence-Gated Reflection for Reward Modeling Feb 24, 2026	Yes Pairwise Preference , Critique Edit	Automatic Metrics	Not Reported	Accuracy , Cost	Not Reported
VULCA-Bench: A Multicultural Vision-Language Benchmark for Evaluating Cultural Understanding Jan 12, 2026	Yes Critique Edit	Not Reported	Vulca Bench	Not Reported	Not Reported
Cross-Cultural Expert-Level Art Critique Evaluation with Vision-Language Models Jan 12, 2026	Yes Rubric Rating , Critique Edit	Not Reported	Not Reported	Not Reported	Calibration
Unlocking Reasoning Capability on Machine Translation in Large Language Models Feb 16, 2026	Yes Critique Edit	Not Reported	Not Reported	Not Reported	Not Reported
Towards Better RL Training Data Utilization via Second-Order Rollout Feb 26, 2026	Yes Critique Edit	Not Reported	Not Reported	Not Reported	Not Reported
Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information Feb 25, 2026	Yes Critique Edit	Automatic Metrics	Not Reported	Not Reported	Not Reported
Large Language Models and Impossible Language Acquisition: "False Promise" or an Overturn of our Current Perspective towards AI Feb 9, 2026	Yes Critique Edit	Automatic Metrics	Not Reported	Not Reported	Not Reported
The logic of KM belief update is contained in the logic of AGM belief revision Feb 26, 2026	Yes Critique Edit	Not Reported	Not Reported	Not Reported	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	HLE-Verified: A Systematic Verification and Structu…	Can Large Language Models Replace Human Coders? Int…	RebuttalAgent: Strategic Persuasion in Academic Reb…
Human Feedback	Expert Verification, Critique Edit	Critique Edit	Pairwise Preference, Critique Edit
Evaluation Modes	Automatic Metrics	Automatic Metrics	Human Eval
Benchmarks	HLE	ContentBench	Rebuttalbench
Metrics	Accuracy	Agreement, Cost	Not reported
Quality Controls	Adjudication	Not reported	Not reported
Rater Population	Domain Experts	Unknown	Unknown
Annotation Unit	Unknown	Unknown	Unknown

Research Utility Snapshot

Human Feedback Mix

Critique Edit (15)
Pairwise Preference (3)
Expert Verification (1)
Rubric Rating (1)

Evaluation Modes

Automatic Metrics (5)
Human Eval (1)
Simulation Env (1)

Top Benchmarks

ContentBench (1)
HLE (1)
Rebuttalbench (1)
Vulca Bench (1)

Top Metrics

Accuracy (2)
Cost (2)
Agreement (1)
Coherence (1)

Rater Population Mix

Domain Experts (3)

Quality Controls

Adjudication (1)
Calibration (1)

Coverage diagnostics (sample-based): human-feedback 100.0% · benchmarks 26.7% · metrics 26.7% · quality controls 13.3%.

Top Papers

HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam
Weiqi Zhai, Zhihai Wang, Jinghang Wang, Boyu Yang, Xiaogang Li · Feb 15, 2026 · Citations: 0

Expert VerificationCritique Edit Automatic Metrics

Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions.
RebuttalAgent: Strategic Persuasion in Academic Rebuttal via Theory of Mind
Zhitao He, Zongwei Lyu, Yi R Fung · Jan 22, 2026 · Citations: 0

Pairwise PreferenceCritique Edit Human Eval

In this paper, we introduce RebuttalAgent, the first framework to ground academic rebuttal in Theory of Mind (ToM), operationalized through a ToM-Strategy-Response (TSR) framework that models reviewer mental state, formulates persuasion…
From Pixels to Policies: Reinforcing Spatial Reasoning in Language Models for Content-Aware Layout Design
Sha Li, Stefano Petrangeli, Yu Shen, Xiang Chen · Feb 14, 2026 · Citations: 0

Critique Edit Simulation Env

We introduce LaySPA, a reinforcement learning framework that equips large language models (LLMs) with explicit and interpretable spatial reasoning for content-aware graphic layout design.
Cross-Cultural Expert-Level Art Critique Evaluation with Vision-Language Models
Haorui Yu, Xuehang Wen, Fengrui Zhang, Qiufeng Yi · Jan 12, 2026 · Citations: 0

Rubric RatingCritique Edit

Existing benchmarks assess perception without interpretation, and common evaluation proxies, such as automated metrics and LLM-judge averaging, are unreliable for culturally sensitive generative tasks.
Can Large Language Models Replace Human Coders? Introducing ContentBench
Michael Haman · Feb 23, 2026 · Citations: 0

Critique Edit Automatic Metrics

This paper introduces ContentBench, a public benchmark suite that helps answer this replacement question by tracking how much agreement low-cost LLMs achieve and what they cost on the same interpretive coding tasks.
CAMEL: Confidence-Gated Reflection for Reward Modeling
Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar · Feb 24, 2026 · Citations: 0

Pairwise PreferenceCritique Edit Automatic Metrics

Building on this insight, we propose CAMEL, a confidence-gated reflection framework that performs a lightweight single-token preference decision first and selectively invokes reflection only for low-confidence instances.
VULCA-Bench: A Multicultural Vision-Language Benchmark for Evaluating Cultural Understanding
Haorui Yu, Diji Yang, Hang He, Fengrui Zhang, Qiufeng Yi · Jan 12, 2026 · Citations: 0

Critique Edit

We introduce VULCA-Bench, a multicultural art-critique benchmark for evaluating Vision-Language Models' (VLMs) cultural understanding beyond surface-level visual perception.
Unlocking Reasoning Capability on Machine Translation in Large Language Models
Sara Rajaee, Sebastian Vincent, Alexandre Berard, Marzieh Fadaee, Kelly Marchisio · Feb 16, 2026 · Citations: 0

Critique Edit Long Horizon

We systematically evaluate several open- and closed-weights RLMs on the WMT24++ benchmark and find that enabling explicit reasoning consistently degrades translation quality across languages and models.
Towards Better RL Training Data Utilization via Second-Order Rollout
Zhe Yang, Yudong Wang, Rang Li, Zhifang Sui · Feb 26, 2026 · Citations: 0

Critique Edit

Reinforcement Learning (RL) has empowered Large Language Models (LLMs) with strong reasoning capabilities, but vanilla RL mainly focuses on generation capability improvement by training with only first-order rollout (generating multiple…
Reward Modeling from Natural Language Human Feedback
Zongqi Wang, Rui Wang, Yuchuan Wu, Yiyao Yu, Pinyi Zhang · Jan 12, 2026 · Citations: 0

Pairwise PreferenceCritique Edit

To address this issue, we propose Reward Modeling from Natural Language Human Feedback (RM-NLHF), which leverages natural language feedback to obtain process reward signals, thereby mitigating the problem of limited solution space inherent…
Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information
Umid Suleymanov, Zaur Rajabov, Emil Mirzazada, Murat Kantarcioglu · Feb 25, 2026 · Citations: 0

Critique Edit Automatic Metrics

To address this, we introduce SemSIEdit, an inference-time framework where an agentic "Editor" iteratively critiques and rewrites sensitive spans to preserve narrative flow rather than simply refusing to answer.
Large Language Models and Impossible Language Acquisition: "False Promise" or an Overturn of our Current Perspective towards AI
Ziyan Wang, Longlong Ma · Feb 9, 2026 · Citations: 0

Critique Edit Automatic Metrics

In Chomsky's provocative critique "The False Promise of CHATGPT," Large Language Models (LLMs) are characterized as mere pattern predictors that do not acquire languages via intrinsic causal and self-correction structures like humans, there
The logic of KM belief update is contained in the logic of AGM belief revision
Giacomo Bonanno · Feb 26, 2026 · Citations: 0

Critique Edit

Denoting the latter by \mathcal L_{AGM} and the former by \mathcal L_{KM} we show that every axiom of \mathcal L_{KM} is a theorem of \mathcal L_{AGM}.
Natural Language Declarative Prompting (NLD-P): A Modular Governance Method for Prompt Design Under Model Drift
Hyunwoo Kim, Hanau Yi, Jaehee Bae, Yumin Kim · Feb 26, 2026 · Citations: 0

Critique Edit

NLD-P is formalized as a modular control abstraction that separates provenance, constraint logic, task content, and post-generation evaluation, encoded directly in natural language without reliance on external orchestration code.
Tool-Aware Planning in Contact Center AI: Evaluating LLMs through Lineage-Guided Query Decomposition
Varun Nathan, Shreyas Guha, Ayush Kumar · Feb 16, 2026 · Citations: 0

Critique Edit

We present a domain-grounded framework and benchmark for tool-aware plan generation in contact centers, where answering a query for business insights, our target use case, requires decomposing it into executable steps over structured tools…

Related Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote