HFEPX Hub

General + Demonstrations (Last 60 Days)

Updated from current HFEPX corpus (Mar 8, 2026). 10 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Mar 8, 2026). 10 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Multi Dim Rubric. Frequently cited benchmark: Auditbench. Common metric signal: cost. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 5, 2026.

Papers: 10 Last published: Mar 5, 2026 Global RSS Tag RSS

GeneralDemonstrationsLast 60d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (10) Replication-Ready Only (0)

High-Signal Coverage

100.0%

10 / 10 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

0 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
0 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Use this page for scouting only; collect additional papers before attempting replication-critical comparisons.

Why This Matters For Eval Research

100% of papers report explicit human-feedback signals, led by demonstration data.
automatic metrics appears in 10% of papers in this hub.
Auditbench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly multi-dimensional rubrics; use this to scope replication staffing.
Stratify by benchmark (Auditbench vs Fewmmbench) before comparing methods.

Benchmark Interpretation

Auditbench appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.
Fewmmbench appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.

Metric Interpretation

cost is reported in 10% of hub papers (1/10); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Strong: Papers with explicit human feedback

Coverage is strong (100% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (20% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (10% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (20% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (20% vs 35% target).

Strengths

Strong human-feedback signal (100% of papers).

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (20% coverage).
Annotation unit is under-specified (20% coverage).

Suggested Next Analyses

Stratify by benchmark (Auditbench vs Fewmmbench) before comparing methods.

Recommended Queries (Expanded)

Recommended Queries

Benchmark Slice: Auditbench Metric Slice: cost Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

AuditBench: Evaluating Alignment Auditing Techniques on Models with H…

Highest protocol score with explicit human/eval signal plus Auditbench.

Strongest benchmark reference

FewMMBench: A Benchmark for Multimodal Few-Shot Learning

Fewmmbench gives a fast comparison anchor.

Strongest recent paper

Orchestration-Free Customer Service Automation: A Privacy-Preserving…

Useful for current practice scanning; published Feb 17, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors
Feb 26, 2026 · Citations: 0 · Score: 6.5

HF: Demonstrations · Eval: Not reported · Benchmark: Auditbench · Metric: Not Reported
FewMMBench: A Benchmark for Multimodal Few-Shot Learning
Feb 25, 2026 · Citations: 0 · Score: 6.5

HF: Demonstrations · Eval: Not reported · Benchmark: Fewmmbench · Metric: Not Reported
Orchestration-Free Customer Service Automation: A Privacy-Preserving and Flowchart-Guided Framework
Feb 17, 2026 · Citations: 0 · Score: 6.0

HF: Demonstrations · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Cost
TimeWarp: Evaluating Web Agents by Revisiting the Past
Mar 5, 2026 · Citations: 0 · Score: 4.5

HF: Demonstrations · Eval: Not reported · Benchmark: Not Reported · Metric: Not Reported
Optimizing In-Context Demonstrations for LLM-based Automated Grading
Feb 28, 2026 · Citations: 0 · Score: 4.5

HF: Rubric Rating, Demonstrations · Eval: Not reported · Benchmark: Not Reported · Metric: Not Reported
Risk-Aware World Model Predictive Control for Generalizable End-to-End Autonomous Driving
Feb 26, 2026 · Citations: 0 · Score: 4.5

HF: Demonstrations · Eval: Not reported · Benchmark: Not Reported · Metric: Not Reported

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors Feb 26, 2026	Yes Demonstrations	Not Reported	Auditbench	Not Reported	Not Reported
FewMMBench: A Benchmark for Multimodal Few-Shot Learning Feb 25, 2026	Yes Demonstrations	Not Reported	Fewmmbench	Not Reported	Not Reported
Orchestration-Free Customer Service Automation: A Privacy-Preserving and Flowchart-Guided Framework Feb 17, 2026	Yes Demonstrations	Automatic Metrics	Not Reported	Cost	Not Reported
TimeWarp: Evaluating Web Agents by Revisiting the Past Mar 5, 2026	Yes Demonstrations	Not Reported	Not Reported	Not Reported	Not Reported
Optimizing In-Context Demonstrations for LLM-based Automated Grading Feb 28, 2026	Yes Rubric Rating , Demonstrations	Not Reported	Not Reported	Not Reported	Not Reported
Risk-Aware World Model Predictive Control for Generalizable End-to-End Autonomous Driving Feb 26, 2026	Yes Demonstrations	Not Reported	Not Reported	Not Reported	Not Reported
ArgLLM-App: An Interactive System for Argumentative Reasoning with Large Language Models Feb 27, 2026	Yes Demonstrations	Not Reported	Not Reported	Not Reported	Not Reported
Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models Feb 26, 2026	Yes Demonstrations	Not Reported	Not Reported	Not Reported	Not Reported
Explore-on-Graph: Incentivizing Autonomous Exploration of Large Language Models on Knowledge Graphs with Path-refined Reward Modeling Feb 25, 2026	Yes Demonstrations	Not Reported	Not Reported	Not Reported	Not Reported
Perspectives - Interactive Document Clustering in the Discourse Analysis Tool Suite Feb 17, 2026	Yes Demonstrations	Not Reported	Not Reported	Not Reported	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	AuditBench: Evaluating Alignment Auditing Technique…	FewMMBench: A Benchmark for Multimodal Few-Shot Lea…	Orchestration-Free Customer Service Automation: A P…
Human Feedback	Demonstrations	Demonstrations	Demonstrations
Evaluation Modes	Not reported	Not reported	Automatic Metrics
Benchmarks	Auditbench	Fewmmbench	Not reported
Metrics	Not reported	Not reported	Cost
Quality Controls	Not reported	Not reported	Not reported
Rater Population	Unknown	Unknown	Unknown
Annotation Unit	Unknown	Unknown	Unknown

Research Utility Snapshot

Human Feedback Mix

Demonstrations (10)
Rubric Rating (1)

Evaluation Modes

Automatic Metrics (1)

Top Benchmarks

Auditbench (1)
Fewmmbench (1)

Top Metrics

Cost (1)

Rater Population Mix

Domain Experts (2)

Quality Controls

Coverage diagnostics (sample-based): human-feedback 100.0% · benchmarks 20.0% · metrics 10.0% · quality controls 0.0%.

Top Papers

TimeWarp: Evaluating Web Agents by Revisiting the Past
Md Farhan Ishmam, Kenneth Marino · Mar 5, 2026 · Citations: 0

Demonstrations Web Browsing

The improvement of web agents on current benchmarks raises the question: Do today's agents perform just as well when the web changes?
Orchestration-Free Customer Service Automation: A Privacy-Preserving and Flowchart-Guided Framework
Mengze Hong, Chen Jason Zhang, Zichang Guo, Hanlin Gu, Di Jiang · Feb 17, 2026 · Citations: 0

Demonstrations Automatic Metrics

Existing approaches either rely on modular system designs with extensive agent orchestration or employ over-simplified instruction schemas, providing limited guidance and poor generalizability.
Optimizing In-Context Demonstrations for LLM-based Automated Grading
Yucheng Chu, Hang Li, Kaiqi Yang, Yasemin Copur-Gencturk, Kevin Haudek · Feb 28, 2026 · Citations: 0

Rubric RatingDemonstrations

GUIDE paves the way for trusted, scalable assessment systems that align closely with human pedagogical standards.
AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors
Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R. Bowman · Feb 26, 2026 · Citations: 0

Demonstrations

We introduce AuditBench, an alignment auditing benchmark.
FewMMBench: A Benchmark for Multimodal Few-Shot Learning
Mustafa Dogan, Ilker Kesen, Iacer Calixto, Aykut Erdem, Erkut Erdem · Feb 25, 2026 · Citations: 0

Demonstrations

In this paper, we introduce FewMMBench, a comprehensive benchmark designed to evaluate MLLMs under few-shot conditions, with a focus on In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting.
Risk-Aware World Model Predictive Control for Generalizable End-to-End Autonomous Driving
Jiangxin Sun, Feng Xue, Teng Long, Chang Liu, Jian-Fang Hu · Feb 26, 2026 · Citations: 0

Demonstrations

Practically, RaWMPC leverages a world model to predict the consequences of multiple candidate actions and selects low-risk actions through explicit risk evaluation.
ArgLLM-App: An Interactive System for Argumentative Reasoning with Large Language Models
Adam Dejl, Deniz Gorur, Francesca Toni · Feb 27, 2026 · Citations: 0

Demonstrations

Argumentative LLMs (ArgLLMs) are an existing approach leveraging Large Language Models (LLMs) and computational argumentation for decision-making, with the aim of making the resulting decisions faithfully explainable to and contestable by…
Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models
Chungpa Lee, Jy-yong Sohn, Kangwook Lee · Feb 26, 2026 · Citations: 0

Demonstrations

We show that fine-tuning all attention parameters can harm in-context learning, whereas restricting updates to the value matrix improves zero-shot performance while preserving in-context learning.
Explore-on-Graph: Incentivizing Autonomous Exploration of Large Language Models on Knowledge Graphs with Path-refined Reward Modeling
Shiqi Yan, Yubo Chen, Ruiqi Zhou, Zhengxi Yao, Shuai Chen · Feb 25, 2026 · Citations: 0

Demonstrations

Extensive experiments on five KGQA benchmark datasets demonstrate that, to the best of our knowledge, our method achieves state-of-the-art performance, outperforming not only open-source but also even closed-source LLMs.
Perspectives - Interactive Document Clustering in the Discourse Analysis Tool Suite
Tim Fischer, Chris Biemann · Feb 17, 2026 · Citations: 0

Demonstrations

This paper introduces Perspectives, an interactive extension of the Discourse Analysis Tool Suite designed to empower Digital Humanities (DH) scholars to explore and organize large, unstructured document collections.

Related Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote