HFEPX Hub

General + Demonstrations Papers

Updated from current HFEPX corpus (Mar 8, 2026). 20 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Mar 8, 2026). 20 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Multi Dim Rubric. Frequently cited benchmark: Auditbench. Common metric signal: win rate. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 4, 2025.

Papers: 20 Last published: Feb 4, 2025 Global RSS Tag RSS

GeneralDemonstrations

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (20) Replication-Ready Only (0)

High-Signal Coverage

100.0%

20 / 20 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

0 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
0 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Use this page for scouting only; collect additional papers before attempting replication-critical comparisons.

Why This Matters For Eval Research

100% of papers report explicit human-feedback signals, led by demonstration data.
automatic metrics appears in 25% of papers in this hub.
Auditbench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly multi-dimensional rubrics; use this to scope replication staffing.
Stratify by benchmark (Auditbench vs Fewmmbench) before comparing methods.

Benchmark Interpretation

Auditbench appears in 5% of hub papers (1/20); use this cohort for benchmark-matched comparisons.
Fewmmbench appears in 5% of hub papers (1/20); use this cohort for benchmark-matched comparisons.

Metric Interpretation

win rate is reported in 10% of hub papers (2/20); compare with a secondary metric before ranking methods.
cost is reported in 5% of hub papers (1/20); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Strong: Papers with explicit human feedback

Coverage is strong (100% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (10% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (20% vs 35% target).
Moderate: Papers with known rater population

Coverage is usable but incomplete (25% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (15% vs 35% target).

Strengths

Strong human-feedback signal (100% of papers).
Agentic evaluation appears in 30% of papers.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Annotation unit is under-specified (15% coverage).
Benchmark coverage is thin (10% of papers mention benchmarks/datasets).

Suggested Next Analyses

Stratify by benchmark (Auditbench vs Fewmmbench) before comparing methods.
Track metric sensitivity by reporting both win rate and cost.

Recommended Queries (Expanded)

Recommended Queries

Benchmark Slice: Auditbench Metric Slice: win rate Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

AuditBench: Evaluating Alignment Auditing Techniques on Models with H…

Highest protocol score with explicit human/eval signal plus Auditbench.

Strongest benchmark reference

FewMMBench: A Benchmark for Multimodal Few-Shot Learning

Fewmmbench gives a fast comparison anchor.

Strongest recent paper

Orchestration-Free Customer Service Automation: A Privacy-Preserving…

Useful for current practice scanning; published Feb 17, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors
Feb 26, 2026 · Citations: 0 · Score: 6.5

HF: Demonstrations · Eval: Not reported · Benchmark: Auditbench · Metric: Not Reported
FewMMBench: A Benchmark for Multimodal Few-Shot Learning
Feb 25, 2026 · Citations: 0 · Score: 6.5

HF: Demonstrations · Eval: Not reported · Benchmark: Fewmmbench · Metric: Not Reported
Orchestration-Free Customer Service Automation: A Privacy-Preserving and Flowchart-Guided Framework
Feb 17, 2026 · Citations: 0 · Score: 6.0

HF: Demonstrations · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Cost
Mastering Multi-Drone Volleyball through Hierarchical Co-Self-Play Reinforcement Learning
May 7, 2025 · Citations: 0 · Score: 4.5

HF: Demonstrations · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Win rate
TimeWarp: Evaluating Web Agents by Revisiting the Past
Mar 5, 2026 · Citations: 0 · Score: 4.5

HF: Demonstrations · Eval: Not reported · Benchmark: Not Reported · Metric: Not Reported
Optimizing In-Context Demonstrations for LLM-based Automated Grading
Feb 28, 2026 · Citations: 0 · Score: 4.5

HF: Rubric Rating, Demonstrations · Eval: Not reported · Benchmark: Not Reported · Metric: Not Reported

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors Feb 26, 2026	Yes Demonstrations	Not Reported	Auditbench	Not Reported	Not Reported
FewMMBench: A Benchmark for Multimodal Few-Shot Learning Feb 25, 2026	Yes Demonstrations	Not Reported	Fewmmbench	Not Reported	Not Reported
Orchestration-Free Customer Service Automation: A Privacy-Preserving and Flowchart-Guided Framework Feb 17, 2026	Yes Demonstrations	Automatic Metrics	Not Reported	Cost	Not Reported
Mastering Multi-Drone Volleyball through Hierarchical Co-Self-Play Reinforcement Learning May 7, 2025	Yes Demonstrations	Automatic Metrics	Not Reported	Win rate	Not Reported
TimeWarp: Evaluating Web Agents by Revisiting the Past Mar 5, 2026	Yes Demonstrations	Not Reported	Not Reported	Not Reported	Not Reported
Optimizing In-Context Demonstrations for LLM-based Automated Grading Feb 28, 2026	Yes Rubric Rating , Demonstrations	Not Reported	Not Reported	Not Reported	Not Reported
Risk-Aware World Model Predictive Control for Generalizable End-to-End Autonomous Driving Feb 26, 2026	Yes Demonstrations	Not Reported	Not Reported	Not Reported	Not Reported
ArgLLM-App: An Interactive System for Argumentative Reasoning with Large Language Models Feb 27, 2026	Yes Demonstrations	Not Reported	Not Reported	Not Reported	Not Reported
Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models Feb 26, 2026	Yes Demonstrations	Not Reported	Not Reported	Not Reported	Not Reported
Explore-on-Graph: Incentivizing Autonomous Exploration of Large Language Models on Knowledge Graphs with Path-refined Reward Modeling Feb 25, 2026	Yes Demonstrations	Not Reported	Not Reported	Not Reported	Not Reported
Perspectives - Interactive Document Clustering in the Discourse Analysis Tool Suite Feb 17, 2026	Yes Demonstrations	Not Reported	Not Reported	Not Reported	Not Reported
VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play Feb 4, 2025	Yes Demonstrations	Automatic Metrics , Simulation Env	Not Reported	Win rate	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	AuditBench: Evaluating Alignment Auditing Technique…	FewMMBench: A Benchmark for Multimodal Few-Shot Lea…	Orchestration-Free Customer Service Automation: A P…
Human Feedback	Demonstrations	Demonstrations	Demonstrations
Evaluation Modes	Not reported	Not reported	Automatic Metrics
Benchmarks	Auditbench	Fewmmbench	Not reported
Metrics	Not reported	Not reported	Cost
Quality Controls	Not reported	Not reported	Not reported
Rater Population	Unknown	Unknown	Unknown
Annotation Unit	Unknown	Unknown	Unknown

Research Utility Snapshot

Human Feedback Mix

Demonstrations (20)
Pairwise Preference (2)
Rubric Rating (1)

Evaluation Modes

Automatic Metrics (5)
Simulation Env (3)

Top Benchmarks

Auditbench (1)
Fewmmbench (1)

Top Metrics

Win rate (2)
Cost (1)
Success rate (1)
Task success (1)

Rater Population Mix

Domain Experts (5)

Quality Controls

Coverage diagnostics (sample-based): human-feedback 100.0% · benchmarks 10.0% · metrics 20.0% · quality controls 0.0%.

Top Papers

VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play
Zelai Xu, Ruize Zhang, Chao Yu, Huining Yuan, Xiangmin Yi · Feb 4, 2025 · Citations: 0

Demonstrations Automatic MetricsSimulation Env Multi Agent

We provide a comprehensive suite of tasks ranging from single-drone drills to multi-drone cooperative and competitive tasks, accompanied by baseline evaluations of representative reinforcement learning (RL), multi-agent reinforcement…
Mastering Multi-Drone Volleyball through Hierarchical Co-Self-Play Reinforcement Learning
Ruize Zhang, Sirui Xiang, Zelai Xu, Feng Gao, Shilong Ji · May 7, 2025 · Citations: 0

Demonstrations Automatic Metrics Long Horizon

The task is turn-based, multi-agent, and physically grounded, posing significant challenges due to its long-horizon dependencies, tight inter-agent coupling, and the underactuated dynamics of quadrotors.
MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation
Chengshu Li, Mengdi Xu, Arpit Bahety, Hang Yin, Yunfan Jiang · Oct 21, 2025 · Citations: 0

Demonstrations Simulation Env Long Horizon

Imitation learning from large-scale, diverse human demonstrations has been shown to be effective for training robots, but collecting such data is costly and time-consuming.
SPACeR: Self-Play Anchoring with Centralized Reference Models
Wei-Jer Chang, Akshay Rangesh, Kevin Joseph, Matthew Strong, Masayoshi Tomizuka · Oct 20, 2025 · Citations: 0

Demonstrations Simulation Env Multi Agent

Developing autonomous vehicles (AVs) requires not only safety and efficiency, but also realistic, human-like behaviors that are socially aware and predictable.
CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation
Faria Huq, Zora Zhiruo Wang, Frank F. Xu, Tianyue Ou, Shuyan Zhou · Jan 28, 2025 · Citations: 0

Pairwise PreferenceDemonstrations Automatic Metrics Web Browsing

We propose CowPilot, a framework supporting autonomous as well as human-agent collaborative web navigation, and evaluation across task success and task efficiency.
TimeWarp: Evaluating Web Agents by Revisiting the Past
Md Farhan Ishmam, Kenneth Marino · Mar 5, 2026 · Citations: 0

Demonstrations Web Browsing

The improvement of web agents on current benchmarks raises the question: Do today's agents perform just as well when the web changes?
Orchestration-Free Customer Service Automation: A Privacy-Preserving and Flowchart-Guided Framework
Mengze Hong, Chen Jason Zhang, Zichang Guo, Hanlin Gu, Di Jiang · Feb 17, 2026 · Citations: 0

Demonstrations Automatic Metrics

Existing approaches either rely on modular system designs with extensive agent orchestration or employ over-simplified instruction schemas, providing limited guidance and poor generalizability.
Optimizing In-Context Demonstrations for LLM-based Automated Grading
Yucheng Chu, Hang Li, Kaiqi Yang, Yasemin Copur-Gencturk, Kevin Haudek · Feb 28, 2026 · Citations: 0

Rubric RatingDemonstrations

GUIDE paves the way for trusted, scalable assessment systems that align closely with human pedagogical standards.
Learning to Answer from Correct Demonstrations
Nirmit Joshi, Gene Li, Siddharth Bhandari, Shiva Prasad Kasiviswanathan, Cong Ma · Oct 17, 2025 · Citations: 0

Demonstrations Automatic Metrics

We study the problem of learning to generate an answer (or completion) to a question (or prompt), where there could be multiple correct answers, any one of which is acceptable at test time.
AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors
Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R. Bowman · Feb 26, 2026 · Citations: 0

Demonstrations

We introduce AuditBench, an alignment auditing benchmark.
FewMMBench: A Benchmark for Multimodal Few-Shot Learning
Mustafa Dogan, Ilker Kesen, Iacer Calixto, Aykut Erdem, Erkut Erdem · Feb 25, 2026 · Citations: 0

Demonstrations

In this paper, we introduce FewMMBench, a comprehensive benchmark designed to evaluate MLLMs under few-shot conditions, with a focus on In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting.
Risk-Aware World Model Predictive Control for Generalizable End-to-End Autonomous Driving
Jiangxin Sun, Feng Xue, Teng Long, Chang Liu, Jian-Fang Hu · Feb 26, 2026 · Citations: 0

Demonstrations

Practically, RaWMPC leverages a world model to predict the consequences of multiple candidate actions and selects low-risk actions through explicit risk evaluation.
Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment
Ruoxi Cheng, Haoxuan Ma, Weixin Wang, Ranjie Duan, Jiexi Liu · Mar 23, 2025 · Citations: 0

Pairwise PreferenceDemonstrations

Existing techniques are either reward-based (training a reward model on preference pairs and optimizing with reinforcement learning) or reward-free (directly fine-tuning on ranked outputs).
ArgLLM-App: An Interactive System for Argumentative Reasoning with Large Language Models
Adam Dejl, Deniz Gorur, Francesca Toni · Feb 27, 2026 · Citations: 0

Demonstrations

Argumentative LLMs (ArgLLMs) are an existing approach leveraging Large Language Models (LLMs) and computational argumentation for decision-making, with the aim of making the resulting decisions faithfully explainable to and contestable by…
Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models
Chungpa Lee, Jy-yong Sohn, Kangwook Lee · Feb 26, 2026 · Citations: 0

Demonstrations

We show that fine-tuning all attention parameters can harm in-context learning, whereas restricting updates to the value matrix improves zero-shot performance while preserving in-context learning.
Explore-on-Graph: Incentivizing Autonomous Exploration of Large Language Models on Knowledge Graphs with Path-refined Reward Modeling
Shiqi Yan, Yubo Chen, Ruiqi Zhou, Zhengxi Yao, Shuai Chen · Feb 25, 2026 · Citations: 0

Demonstrations

Extensive experiments on five KGQA benchmark datasets demonstrate that, to the best of our knowledge, our method achieves state-of-the-art performance, outperforming not only open-source but also even closed-source LLMs.
Perspectives - Interactive Document Clustering in the Discourse Analysis Tool Suite
Tim Fischer, Chris Biemann · Feb 17, 2026 · Citations: 0

Demonstrations

This paper introduces Perspectives, an interactive extension of the Discourse Analysis Tool Suite designed to empower Digital Humanities (DH) scholars to explore and organize large, unstructured document collections.
AITutor-EvalKit: Exploring the Capabilities of AI Tutors
Numaan Naeem, Kaushal Kumar Maurya, Kseniia Petukhova, Ekaterina Kochmar · Dec 3, 2025 · Citations: 0

Demonstrations

We present AITutor-EvalKit, an application that uses language technology to evaluate the pedagogical quality of AI tutors, provides software for demonstration and evaluation, as well as model inspection and data visualization.
Bridging Kolmogorov Complexity and Deep Learning: Asymptotically Optimal Description Length Objectives for Transformers
Peter Shaw, James Cohan, Jacob Eisenstein, Kristina Toutanova · Sep 26, 2025 · Citations: 0

Demonstrations

The Minimum Description Length (MDL) principle offers a formal framework for applying Occam's razor in machine learning.
REFLEX: Metacognitive Reasoning for Reflective Zero-Shot Robotic Planning with Large Language Models
Wenjie Lin, Jin Wei-Kocsis, Jiansong Zhang, Byung-Cheol Min, Dongming Gan · May 20, 2025 · Citations: 0

Demonstrations

Inspired by human metacognitive learning and creative problem-solving, we address this limitation by exploring a fundamental question: Can LLMs be empowered with metacognitive capabilities to reason, reflect, and create, thereby enhancing…

Related Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote