HFEPX Hub

Demonstrations Papers (Last 60 Days)

Updated from current HFEPX corpus (Mar 8, 2026). 14 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Mar 8, 2026). 14 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequently cited benchmark: Auditbench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 5, 2026.

Papers: 14 Last published: Mar 5, 2026 Global RSS Tag RSS

DemonstrationsLast 60d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (14) Replication-Ready Only (0)

High-Signal Coverage

100.0%

14 / 14 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

0 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
0 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Use this page for scouting only; collect additional papers before attempting replication-critical comparisons.

Why This Matters For Eval Research

100% of papers report explicit human-feedback signals, led by demonstration data.
automatic metrics appears in 14.3% of papers in this hub.
Auditbench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Stratify by benchmark (Auditbench vs Fewmmbench) before comparing methods.

Benchmark Interpretation

Auditbench appears in 7.1% of hub papers (1/14); use this cohort for benchmark-matched comparisons.
Fewmmbench appears in 7.1% of hub papers (1/14); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 7.1% of hub papers (1/14); compare with a secondary metric before ranking methods.
cost is reported in 7.1% of hub papers (1/14); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Strong: Papers with explicit human feedback

Coverage is strong (100% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (14.3% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (14.3% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (14.3% vs 35% target).
Moderate: Papers with known annotation unit

Coverage is usable but incomplete (21.4% vs 35% target).

Strengths

Strong human-feedback signal (100% of papers).

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (14.3% coverage).
Annotation unit is under-specified (21.4% coverage).

Suggested Next Analyses

Stratify by benchmark (Auditbench vs Fewmmbench) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.

Recommended Queries (Expanded)

Recommended Queries

Benchmark Slice: Auditbench Metric Slice: accuracy Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

AuditBench: Evaluating Alignment Auditing Techniques on Models with H…

Highest protocol score with explicit human/eval signal plus Auditbench.

Strongest benchmark reference

FewMMBench: A Benchmark for Multimodal Few-Shot Learning

Fewmmbench gives a fast comparison anchor.

Strongest recent paper

IDP Accelerator: Agentic Document Intelligence from Extraction to Com…

Useful for current practice scanning; published Feb 26, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors
Feb 26, 2026 · Citations: 0 · Score: 6.5

HF: Demonstrations · Eval: Not reported · Benchmark: Auditbench · Metric: Not Reported
FewMMBench: A Benchmark for Multimodal Few-Shot Learning
Feb 25, 2026 · Citations: 0 · Score: 6.5

HF: Demonstrations · Eval: Not reported · Benchmark: Fewmmbench · Metric: Not Reported
IDP Accelerator: Agentic Document Intelligence from Extraction to Compliance Validation
Feb 26, 2026 · Citations: 0 · Score: 6.0

HF: Demonstrations · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
Orchestration-Free Customer Service Automation: A Privacy-Preserving and Flowchart-Guided Framework
Feb 17, 2026 · Citations: 0 · Score: 6.0

HF: Demonstrations · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Cost
RoboPocket: Improve Robot Policies Instantly with Your Phone
Mar 5, 2026 · Citations: 0 · Score: 4.5

HF: Demonstrations · Eval: Not reported · Benchmark: Not Reported · Metric: Not Reported
TimeWarp: Evaluating Web Agents by Revisiting the Past
Mar 5, 2026 · Citations: 0 · Score: 4.5

HF: Demonstrations · Eval: Not reported · Benchmark: Not Reported · Metric: Not Reported

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors Feb 26, 2026	Yes Demonstrations	Not Reported	Auditbench	Not Reported	Not Reported
FewMMBench: A Benchmark for Multimodal Few-Shot Learning Feb 25, 2026	Yes Demonstrations	Not Reported	Fewmmbench	Not Reported	Not Reported
IDP Accelerator: Agentic Document Intelligence from Extraction to Compliance Validation Feb 26, 2026	Yes Demonstrations	Automatic Metrics	Not Reported	Accuracy , Latency	Not Reported
Orchestration-Free Customer Service Automation: A Privacy-Preserving and Flowchart-Guided Framework Feb 17, 2026	Yes Demonstrations	Automatic Metrics	Not Reported	Cost	Not Reported
RoboPocket: Improve Robot Policies Instantly with Your Phone Mar 5, 2026	Yes Demonstrations	Not Reported	Not Reported	Not Reported	Not Reported
TimeWarp: Evaluating Web Agents by Revisiting the Past Mar 5, 2026	Yes Demonstrations	Not Reported	Not Reported	Not Reported	Not Reported
Optimizing In-Context Demonstrations for LLM-based Automated Grading Feb 28, 2026	Yes Rubric Rating , Demonstrations	Not Reported	Not Reported	Not Reported	Not Reported
Risk-Aware World Model Predictive Control for Generalizable End-to-End Autonomous Driving Feb 26, 2026	Yes Demonstrations	Not Reported	Not Reported	Not Reported	Not Reported
From Labor to Collaboration: A Methodological Experiment Using AI Agents to Augment Research Perspectives in Taiwan's Humanities and Social Sciences Feb 19, 2026	Yes Demonstrations	Not Reported	Not Reported	Not Reported	Not Reported
ArgLLM-App: An Interactive System for Argumentative Reasoning with Large Language Models Feb 27, 2026	Yes Demonstrations	Not Reported	Not Reported	Not Reported	Not Reported
Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models Feb 26, 2026	Yes Demonstrations	Not Reported	Not Reported	Not Reported	Not Reported
Explore-on-Graph: Incentivizing Autonomous Exploration of Large Language Models on Knowledge Graphs with Path-refined Reward Modeling Feb 25, 2026	Yes Demonstrations	Not Reported	Not Reported	Not Reported	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	AuditBench: Evaluating Alignment Auditing Technique…	FewMMBench: A Benchmark for Multimodal Few-Shot Lea…	IDP Accelerator: Agentic Document Intelligence from…
Human Feedback	Demonstrations	Demonstrations	Demonstrations
Evaluation Modes	Not reported	Not reported	Automatic Metrics
Benchmarks	Auditbench	Fewmmbench	Not reported
Metrics	Not reported	Not reported	Accuracy, Latency
Quality Controls	Not reported	Not reported	Not reported
Rater Population	Unknown	Unknown	Unknown
Annotation Unit	Unknown	Unknown	Unknown

Research Utility Snapshot

Human Feedback Mix

Demonstrations (14)
Rubric Rating (1)

Evaluation Modes

Automatic Metrics (2)

Top Benchmarks

Auditbench (1)
Fewmmbench (1)

Top Metrics

Accuracy (1)
Cost (1)
Latency (1)

Rater Population Mix

Domain Experts (2)

Quality Controls

Coverage diagnostics (sample-based): human-feedback 100.0% · benchmarks 14.3% · metrics 14.3% · quality controls 0.0%.

Top Papers

RoboPocket: Improve Robot Policies Instantly with Your Phone
Junjie Fang, Wendi Chen, Han Xue, Fangyuan Zhou, Tian Le · Mar 5, 2026 · Citations: 0

Demonstrations Long Horizon

To reconcile this trade-off, we introduce RoboPocket, a portable system that enables Robot-Free Instant Policy Iteration using single consumer smartphones.
TimeWarp: Evaluating Web Agents by Revisiting the Past
Md Farhan Ishmam, Kenneth Marino · Mar 5, 2026 · Citations: 0

Demonstrations Web Browsing

The improvement of web agents on current benchmarks raises the question: Do today's agents perform just as well when the web changes?
IDP Accelerator: Agentic Document Intelligence from Extraction to Compliance Validation
Md Mofijul Islam, Md Sirajus Salekin, Joe King, Priyashree Roy, Vamsi Thilak Gudi · Feb 26, 2026 · Citations: 0

Demonstrations Automatic Metrics

We present IDP (Intelligent Document Processing) Accelerator, a framework enabling agentic AI for end-to-end document intelligence with four key components: (1) DocSplit, a novel benchmark dataset and multimodal classifier using BIO tagging…
Orchestration-Free Customer Service Automation: A Privacy-Preserving and Flowchart-Guided Framework
Mengze Hong, Chen Jason Zhang, Zichang Guo, Hanlin Gu, Di Jiang · Feb 17, 2026 · Citations: 0

Demonstrations Automatic Metrics

Existing approaches either rely on modular system designs with extensive agent orchestration or employ over-simplified instruction schemas, providing limited guidance and poor generalizability.
Optimizing In-Context Demonstrations for LLM-based Automated Grading
Yucheng Chu, Hang Li, Kaiqi Yang, Yasemin Copur-Gencturk, Kevin Haudek · Feb 28, 2026 · Citations: 0

Rubric RatingDemonstrations

GUIDE paves the way for trusted, scalable assessment systems that align closely with human pedagogical standards.
AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors
Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R. Bowman · Feb 26, 2026 · Citations: 0

Demonstrations

We introduce AuditBench, an alignment auditing benchmark.
FewMMBench: A Benchmark for Multimodal Few-Shot Learning
Mustafa Dogan, Ilker Kesen, Iacer Calixto, Aykut Erdem, Erkut Erdem · Feb 25, 2026 · Citations: 0

Demonstrations

In this paper, we introduce FewMMBench, a comprehensive benchmark designed to evaluate MLLMs under few-shot conditions, with a focus on In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting.
Risk-Aware World Model Predictive Control for Generalizable End-to-End Autonomous Driving
Jiangxin Sun, Feng Xue, Teng Long, Chang Liu, Jian-Fang Hu · Feb 26, 2026 · Citations: 0

Demonstrations

Practically, RaWMPC leverages a world model to predict the consequences of multiple candidate actions and selects low-risk actions through explicit risk evaluation.
From Labor to Collaboration: A Methodological Experiment Using AI Agents to Augment Research Perspectives in Taiwan's Humanities and Social Sciences
Yi-Chih Huang · Feb 19, 2026 · Citations: 0

Demonstrations

Generative AI is reshaping knowledge work, yet existing research focuses predominantly on software engineering and the natural sciences, with limited methodological exploration for the humanities and social sciences.
ArgLLM-App: An Interactive System for Argumentative Reasoning with Large Language Models
Adam Dejl, Deniz Gorur, Francesca Toni · Feb 27, 2026 · Citations: 0

Demonstrations

Argumentative LLMs (ArgLLMs) are an existing approach leveraging Large Language Models (LLMs) and computational argumentation for decision-making, with the aim of making the resulting decisions faithfully explainable to and contestable by…
Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models
Chungpa Lee, Jy-yong Sohn, Kangwook Lee · Feb 26, 2026 · Citations: 0

Demonstrations

We show that fine-tuning all attention parameters can harm in-context learning, whereas restricting updates to the value matrix improves zero-shot performance while preserving in-context learning.
Explore-on-Graph: Incentivizing Autonomous Exploration of Large Language Models on Knowledge Graphs with Path-refined Reward Modeling
Shiqi Yan, Yubo Chen, Ruiqi Zhou, Zhengxi Yao, Shuai Chen · Feb 25, 2026 · Citations: 0

Demonstrations

Extensive experiments on five KGQA benchmark datasets demonstrate that, to the best of our knowledge, our method achieves state-of-the-art performance, outperforming not only open-source but also even closed-source LLMs.
Inner Speech as Behavior Guides: Steerable Imitation of Diverse Behaviors for Human-AI coordination
Rakshit Trivedi, Kartik Sharma, David C Parkes · Feb 24, 2026 · Citations: 0

Demonstrations

Effective human-AI coordination requires artificial agents capable of exhibiting and responding to human-like behaviors while adapting to changing contexts.
Perspectives - Interactive Document Clustering in the Discourse Analysis Tool Suite
Tim Fischer, Chris Biemann · Feb 17, 2026 · Citations: 0

Demonstrations

This paper introduces Perspectives, an interactive extension of the Discourse Analysis Tool Suite designed to empower Digital Humanities (DH) scholars to explore and organize large, unstructured document collections.

Related Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote