HFEPX Hub

Critique Edit Papers (Last 45 Days)

Updated from current HFEPX corpus (Mar 1, 2026). 12 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Mar 1, 2026). 12 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Common annotation unit: Scalar. Frequent quality control: Adjudication. Frequently cited benchmark: ContentBench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 15, 2026.

Papers: 12 Last published: Feb 15, 2026 Global RSS Tag RSS

Critique EditLast 45d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (12) Replication-Ready Only (2)

High-Signal Coverage

100.0%

12 / 12 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

2 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
1 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Use this page for scouting only; collect additional papers before attempting replication-critical comparisons.

Why This Matters (Expanded)

Why This Matters For Eval Research

100% of papers report explicit human-feedback signals, led by critique/edit feedback.
automatic metrics appears in 41.7% of papers in this hub.
ContentBench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Notes (Expanded)

Protocol Takeaways

Most common quality-control signal is adjudication (8.3% of papers).
Rater context is mostly domain experts, and annotation is commonly scalar scoring; use this to scope replication staffing.
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Benchmark Interpretation

ContentBench appears in 8.3% of hub papers (1/12); use this cohort for benchmark-matched comparisons.
HLE appears in 8.3% of hub papers (1/12); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 16.7% of hub papers (2/12); compare with a secondary metric before ranking methods.
cost is reported in 16.7% of hub papers (2/12); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Strong: Papers with explicit human feedback

Coverage is strong (100% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (8.3% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (25% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (33.3% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (8.3% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (16.7% vs 35% target).

Strengths

Strong human-feedback signal (100% of papers).

Known Gaps

Only 8.3% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (8.3% coverage).
Annotation unit is under-specified (16.7% coverage).

Suggested Next Analyses

Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.
Stratify by benchmark (ContentBench vs HLE) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries (Expanded)

Recommended Queries

Human Eval Protocols Benchmark Slice: ContentBench Metric Slice: accuracy Recent High-Signal Papers

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam
Feb 15, 2026 · Citations: 0 · Score: 9.5

HF: Expert Verification, Critique Edit · Eval: Automatic Metrics · Benchmark: HLE · Metric: Accuracy
Can Large Language Models Replace Human Coders? Introducing ContentBench
Feb 23, 2026 · Citations: 0 · Score: 8.0

HF: Critique Edit · Eval: Automatic Metrics · Benchmark: ContentBench · Metric: Agreement
RebuttalAgent: Strategic Persuasion in Academic Rebuttal via Theory of Mind
Jan 22, 2026 · Citations: 0 · Score: 6.0

HF: Pairwise Preference, Critique Edit · Eval: Human Eval · Benchmark: Rebuttalbench · Metric: Not Reported
From Pixels to Policies: Reinforcing Spatial Reasoning in Language Models for Content-Aware Layout Design
Feb 14, 2026 · Citations: 0 · Score: 6.0

HF: Critique Edit · Eval: Simulation Env · Benchmark: Not Reported · Metric: Latency
CAMEL: Confidence-Gated Reflection for Reward Modeling
Feb 24, 2026 · Citations: 0 · Score: 6.0

HF: Pairwise Preference, Critique Edit · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
Unlocking Reasoning Capability on Machine Translation in Large Language Models
Feb 16, 2026 · Citations: 0 · Score: 4.5

HF: Critique Edit · Eval: Not reported · Benchmark: Not Reported · Metric: Not Reported

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam Feb 15, 2026	Yes Expert Verification , Critique Edit	Automatic Metrics	HLE	Accuracy	Adjudication
Can Large Language Models Replace Human Coders? Introducing ContentBench Feb 23, 2026	Yes Critique Edit	Automatic Metrics	ContentBench	Agreement , Cost	Not Reported
RebuttalAgent: Strategic Persuasion in Academic Rebuttal via Theory of Mind Jan 22, 2026	Yes Pairwise Preference , Critique Edit	Human Eval	Rebuttalbench	Not Reported	Not Reported
From Pixels to Policies: Reinforcing Spatial Reasoning in Language Models for Content-Aware Layout Design Feb 14, 2026	Yes Critique Edit	Simulation Env	Not Reported	Latency	Not Reported
CAMEL: Confidence-Gated Reflection for Reward Modeling Feb 24, 2026	Yes Pairwise Preference , Critique Edit	Automatic Metrics	Not Reported	Accuracy , Cost	Not Reported
Unlocking Reasoning Capability on Machine Translation in Large Language Models Feb 16, 2026	Yes Critique Edit	Not Reported	Not Reported	Not Reported	Not Reported
Towards Better RL Training Data Utilization via Second-Order Rollout Feb 26, 2026	Yes Critique Edit	Not Reported	Not Reported	Not Reported	Not Reported
Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information Feb 25, 2026	Yes Critique Edit	Automatic Metrics	Not Reported	Not Reported	Not Reported
Large Language Models and Impossible Language Acquisition: "False Promise" or an Overturn of our Current Perspective towards AI Feb 9, 2026	Yes Critique Edit	Automatic Metrics	Not Reported	Not Reported	Not Reported
The logic of KM belief update is contained in the logic of AGM belief revision Feb 26, 2026	Yes Critique Edit	Not Reported	Not Reported	Not Reported	Not Reported
Natural Language Declarative Prompting (NLD-P): A Modular Governance Method for Prompt Design Under Model Drift Feb 26, 2026	Yes Critique Edit	Not Reported	Not Reported	Not Reported	Not Reported
Tool-Aware Planning in Contact Center AI: Evaluating LLMs through Lineage-Guided Query Decomposition Feb 16, 2026	Yes Critique Edit	Not Reported	Not Reported	Not Reported	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	HLE-Verified: A Systematic Verification and Structu…	Can Large Language Models Replace Human Coders? Int…	RebuttalAgent: Strategic Persuasion in Academic Reb…
Human Feedback	Expert Verification, Critique Edit	Critique Edit	Pairwise Preference, Critique Edit
Evaluation Modes	Automatic Metrics	Automatic Metrics	Human Eval
Benchmarks	HLE	ContentBench	Rebuttalbench
Metrics	Accuracy	Agreement, Cost	Not reported
Quality Controls	Adjudication	Not reported	Not reported
Rater Population	Domain Experts	Unknown	Unknown
Annotation Unit	Unknown	Unknown	Unknown

Research Utility Snapshot

Human Feedback Mix

Critique Edit (12)
Pairwise Preference (2)
Expert Verification (1)

Evaluation Modes

Automatic Metrics (5)
Human Eval (1)
Simulation Env (1)

Top Benchmarks

ContentBench (1)
HLE (1)
Rebuttalbench (1)

Top Metrics

Accuracy (2)
Cost (2)
Agreement (1)
Coherence (1)

Rater Population Mix

Domain Experts (1)

Quality Controls

Adjudication (1)

Coverage diagnostics (sample-based): human-feedback 100.0% · benchmarks 25.0% · metrics 33.3% · quality controls 8.3%.

Top Papers

HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam
Weiqi Zhai, Zhihai Wang, Jinghang Wang, Boyu Yang, Xiaogang Li · Feb 15, 2026 · Citations: 0

Expert VerificationCritique Edit Automatic Metrics

Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions.
RebuttalAgent: Strategic Persuasion in Academic Rebuttal via Theory of Mind
Zhitao He, Zongwei Lyu, Yi R Fung · Jan 22, 2026 · Citations: 0

Pairwise PreferenceCritique Edit Human Eval

In this paper, we introduce RebuttalAgent, the first framework to ground academic rebuttal in Theory of Mind (ToM), operationalized through a ToM-Strategy-Response (TSR) framework that models reviewer mental state, formulates persuasion…
From Pixels to Policies: Reinforcing Spatial Reasoning in Language Models for Content-Aware Layout Design
Sha Li, Stefano Petrangeli, Yu Shen, Xiang Chen · Feb 14, 2026 · Citations: 0

Critique Edit Simulation Env

We introduce LaySPA, a reinforcement learning framework that equips large language models (LLMs) with explicit and interpretable spatial reasoning for content-aware graphic layout design.
Can Large Language Models Replace Human Coders? Introducing ContentBench
Michael Haman · Feb 23, 2026 · Citations: 0

Critique Edit Automatic Metrics

This paper introduces ContentBench, a public benchmark suite that helps answer this replacement question by tracking how much agreement low-cost LLMs achieve and what they cost on the same interpretive coding tasks.
CAMEL: Confidence-Gated Reflection for Reward Modeling
Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar · Feb 24, 2026 · Citations: 0

Pairwise PreferenceCritique Edit Automatic Metrics

Building on this insight, we propose CAMEL, a confidence-gated reflection framework that performs a lightweight single-token preference decision first and selectively invokes reflection only for low-confidence instances.
Unlocking Reasoning Capability on Machine Translation in Large Language Models
Sara Rajaee, Sebastian Vincent, Alexandre Berard, Marzieh Fadaee, Kelly Marchisio · Feb 16, 2026 · Citations: 0

Critique Edit Long Horizon

We systematically evaluate several open- and closed-weights RLMs on the WMT24++ benchmark and find that enabling explicit reasoning consistently degrades translation quality across languages and models.
Towards Better RL Training Data Utilization via Second-Order Rollout
Zhe Yang, Yudong Wang, Rang Li, Zhifang Sui · Feb 26, 2026 · Citations: 0

Critique Edit

Reinforcement Learning (RL) has empowered Large Language Models (LLMs) with strong reasoning capabilities, but vanilla RL mainly focuses on generation capability improvement by training with only first-order rollout (generating multiple…
Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information
Umid Suleymanov, Zaur Rajabov, Emil Mirzazada, Murat Kantarcioglu · Feb 25, 2026 · Citations: 0

Critique Edit Automatic Metrics

To address this, we introduce SemSIEdit, an inference-time framework where an agentic "Editor" iteratively critiques and rewrites sensitive spans to preserve narrative flow rather than simply refusing to answer.
Large Language Models and Impossible Language Acquisition: "False Promise" or an Overturn of our Current Perspective towards AI
Ziyan Wang, Longlong Ma · Feb 9, 2026 · Citations: 0

Critique Edit Automatic Metrics

In Chomsky's provocative critique "The False Promise of CHATGPT," Large Language Models (LLMs) are characterized as mere pattern predictors that do not acquire languages via intrinsic causal and self-correction structures like humans, there
The logic of KM belief update is contained in the logic of AGM belief revision
Giacomo Bonanno · Feb 26, 2026 · Citations: 0

Critique Edit

Denoting the latter by \mathcal L_{AGM} and the former by \mathcal L_{KM} we show that every axiom of \mathcal L_{KM} is a theorem of \mathcal L_{AGM}.
Natural Language Declarative Prompting (NLD-P): A Modular Governance Method for Prompt Design Under Model Drift
Hyunwoo Kim, Hanau Yi, Jaehee Bae, Yumin Kim · Feb 26, 2026 · Citations: 0

Critique Edit

NLD-P is formalized as a modular control abstraction that separates provenance, constraint logic, task content, and post-generation evaluation, encoded directly in natural language without reliance on external orchestration code.
Tool-Aware Planning in Contact Center AI: Evaluating LLMs through Lineage-Guided Query Decomposition
Varun Nathan, Shreyas Guha, Ayush Kumar · Feb 16, 2026 · Citations: 0

Critique Edit

We present a domain-grounded framework and benchmark for tool-aware plan generation in contact centers, where answering a query for business insights, our target use case, requires decomposing it into executable steps over structured tools…

Related Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote