HFEPX Hub

Demonstrations Papers (Last 45 Days)

Updated from current HFEPX corpus (Apr 19, 2026). 13 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Apr 19, 2026). 13 papers are grouped in this hub page. Common evaluation modes: Simulation Env, Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequently cited benchmark: Mapg-Bench. Common metric signal: cost. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 22, 2026.

Papers: 13 Last published: Mar 22, 2026 Global RSS Tag RSS

DemonstrationsLast 45d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (13) Replication-Ready Only (1)

High-Signal Coverage

100.0%

13 / 13 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

1 papers are replication-ready (benchmark + metric + explicit evaluation mode).
1 papers support judge-vs-human agreement analysis.
0 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Use this page for scouting only; collect additional papers before attempting replication-critical comparisons.

Need evaluators for this research workflow?

Post a Job →

Why This Matters For Eval Research

100% of papers report explicit human-feedback signals, led by demonstration data.
simulation environments appears in 23.1% of papers in this hub.
Mapg-Bench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

1 sampled papers report both human evaluation and LLM-as-judge, supporting direct agreement checks.
Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.

Benchmark Interpretation

Mapg-Bench appears in 7.7% of hub papers (1/13); use this cohort for benchmark-matched comparisons.
ToolBench appears in 7.7% of hub papers (1/13); use this cohort for benchmark-matched comparisons.

Metric Interpretation

cost is reported in 15.4% of hub papers (2/13); compare with a secondary metric before ranking methods.
precision is reported in 15.4% of hub papers (2/13); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Strong: Papers with explicit human feedback

Coverage is strong (100% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (15.4% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (30.8% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (15.4% vs 35% target).
Strong: Papers with known annotation unit

Coverage is strong (46.2% vs 35% target).

Strengths

Strong human-feedback signal (100% of papers).
Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.
Agentic evaluation appears in 38.5% of papers.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (15.4% coverage).
Benchmark coverage is thin (15.4% of papers mention benchmarks/datasets).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (Mapg-Bench vs ToolBench) before comparing methods.
Track metric sensitivity by reporting both cost and precision.

Recommended Queries (Expanded)

Recommended Queries

Judge vs Human Agreement Benchmark Slice: Mapg-Bench Metric Slice: cost Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabe…

Highest protocol score with explicit human/eval signal plus WebArena.

Strongest benchmark reference

Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vi…

Mapg-Bench gives a fast comparison anchor.

Strongest recent paper

State-of-the-Art Arabic Language Modeling with Sparse MoE Fine-Tuning…

Useful for current practice scanning; published Apr 7, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling
Mar 22, 2026 · Citations: 0 · Score: 9.5

HF: Demonstrations · Eval: Human Eval, Llm As Judge · Benchmark: WebArena · Metric: Precision
Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation
Mar 19, 2026 · Citations: 0 · Score: 6.0

HF: Demonstrations · Eval: Simulation Env · Benchmark: Mapg Bench · Metric: Not Reported
State-of-the-Art Arabic Language Modeling with Sparse MoE Fine-Tuning and Chain-of-Thought Distillation
Apr 7, 2026 · Citations: 0 · Score: 6.0

HF: Demonstrations · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Cost
Reason and Verify: A Framework for Faithful Retrieval-Augmented Generation
Mar 10, 2026 · Citations: 0 · Score: 5.5

HF: Demonstrations · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
A Framework for Closed-Loop Robotic Assembly, Alignment and Self-Recovery of Precision Optical Systems
Mar 23, 2026 · Citations: 0 · Score: 5.5

HF: Demonstrations · Eval: Not reported · Benchmark: Not Reported · Metric: Precision
SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning
Mar 30, 2026 · Citations: 0 · Score: 4.5

HF: Demonstrations · Eval: Simulation Env · Benchmark: Not Reported · Metric: Not Reported

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling Mar 22, 2026	Yes Demonstrations	Human Eval , Llm As Judge	WebArena , ToolBench	Precision , Pass@1	Not Reported
Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation Mar 19, 2026	Yes Demonstrations	Simulation Env	Mapg Bench	Not Reported	Not Reported
State-of-the-Art Arabic Language Modeling with Sparse MoE Fine-Tuning and Chain-of-Thought Distillation Apr 7, 2026	Yes Demonstrations	Automatic Metrics	Not Reported	Not Reported	Not Reported
Reason and Verify: A Framework for Faithful Retrieval-Augmented Generation Mar 10, 2026	Yes Demonstrations	Automatic Metrics	Not Reported	Accuracy , Faithfulness	Not Reported
A Framework for Closed-Loop Robotic Assembly, Alignment and Self-Recovery of Precision Optical Systems Mar 23, 2026	Yes Demonstrations	Not Reported	Not Reported	Precision	Not Reported
SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning Mar 30, 2026	Yes Demonstrations	Simulation Env	Not Reported	Not Reported	Not Reported
Epistemic Blinding: An Inference-Time Protocol for Auditing Prior Contamination in LLM-Assisted Analysis Apr 7, 2026	Yes Demonstrations	Not Reported	Not Reported	Not Reported	Not Reported
A Survey of On-Policy Distillation for Large Language Models Apr 1, 2026	Yes Expert Verification , Demonstrations	Not Reported	Not Reported	Not Reported	Not Reported
In-Context Learning in Speech Language Models: Analyzing the Role of Acoustic Features, Linguistic Structure, and Induction Heads Apr 7, 2026	Yes Demonstrations	Not Reported	Not Reported	Not Reported	Not Reported
RoboPocket: Improve Robot Policies Instantly with Your Phone Mar 5, 2026	Yes Demonstrations	Not Reported	Not Reported	Not Reported	Not Reported
TimeWarp: Evaluating Web Agents by Revisiting the Past Mar 5, 2026	Yes Demonstrations	Not Reported	Not Reported	Not Reported	Not Reported
Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings Mar 11, 2026	Yes Demonstrations	Not Reported	Not Reported	Not Reported	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	AgentHER: Hindsight Experience Replay for LLM Agent…	Meanings and Measurements: Multi-Agent Probabilisti…	State-of-the-Art Arabic Language Modeling with Spar…
Human Feedback	Demonstrations	Demonstrations	Demonstrations
Evaluation Modes	Human Eval, Llm As Judge	Simulation Env	Automatic Metrics
Benchmarks	WebArena, ToolBench	Mapg Bench	Not reported
Metrics	Precision, Pass@1	Not reported	Not reported
Quality Controls	Not reported	Not reported	Not reported
Rater Population	Unknown	Unknown	Unknown
Annotation Unit	Trajectory	Unknown	Unknown

Research Utility Snapshot

Human Feedback Mix

Demonstrations (13)
Expert Verification (1)
Pairwise Preference (1)

Evaluation Modes

Simulation Env (3)
Automatic Metrics (2)
Human Eval (1)
Llm As Judge (1)

Top Benchmarks

Mapg Bench (1)
ToolBench (1)
WebArena (1)

Top Metrics

Cost (2)
Precision (2)
Accuracy (1)
Faithfulness (1)

Rater Population Mix

Domain Experts (2)

Quality Controls

Coverage diagnostics (sample-based): human-feedback 100.0% · benchmarks 15.4% · metrics 30.8% · quality controls 0.0%.

Top Papers

AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling
Liang Ding · Mar 22, 2026 · Citations: 0

Demonstrations Human EvalLlm As Judge Long Horizon

LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely…
Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation
Swagat Padhan, Lakshya Jain, Bhavya Minesh Shah, Omkar Patil, Thao Nguyen · Mar 19, 2026 · Citations: 0

Demonstrations Simulation Env Multi Agent

To address this limitation, we propose MAPG (Multi-Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each component.
SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning
Philip Schroeder, Thomas Weng, Karl Schmeckpeper, Eric Rosen, Stephen Hart · Mar 30, 2026 · Citations: 0

Demonstrations Simulation Env Long Horizon

To address this limitation, we introduce SOLE-R1 (Self-Observing LEarner), a video-language reasoning model explicitly designed to serve as the sole reward signal for online RL.
Reason and Verify: A Framework for Faithful Retrieval-Augmented Generation
Eeham Khan, Luis Rodriguez, Marc Queudot · Mar 10, 2026 · Citations: 0

Demonstrations Automatic Metrics

We evaluate this framework on the BioASQ and PubMedQA benchmarks, specifically analyzing the impact of dynamic in-context learning and rerank- ing under constrained token budgets.
RoboPocket: Improve Robot Policies Instantly with Your Phone
Junjie Fang, Wendi Chen, Han Xue, Fangyuan Zhou, Tian Le · Mar 5, 2026 · Citations: 0

Demonstrations Long Horizon

To reconcile this trade-off, we introduce RoboPocket, a portable system that enables Robot-Free Instant Policy Iteration using single consumer smartphones.
TimeWarp: Evaluating Web Agents by Revisiting the Past
Md Farhan Ishmam, Kenneth Marino · Mar 5, 2026 · Citations: 0

Demonstrations Web Browsing

The improvement of web agents on current benchmarks raises the question: Do today's agents perform just as well when the web changes?
State-of-the-Art Arabic Language Modeling with Sparse MoE Fine-Tuning and Chain-of-Thought Distillation
Navan Preet Singh, Anurag Garikipati, Ahmed Abulkhair, Jyani Akshay Jagdishbhai, Atul Yaduvanshi · Apr 7, 2026 · Citations: 0

Demonstrations Automatic Metrics

Arabic-DeepSeek-R1 achieves the highest average score across the seven-benchmark OALL suite while establishing SOTA or near-SOTA, including dominant results on grammar-focused MadinahQA (surpassing both GPT-5.1 and the OALL leader by…
A Framework for Closed-Loop Robotic Assembly, Alignment and Self-Recovery of Precision Optical Systems
Seou Choi, Sachin Vaidya, Caio Silva, Shiekh Zia Uddin, Sajib Biswas Shuvo · Mar 23, 2026 · Citations: 0

Demonstrations

In this work, we present a robotics framework for the autonomous construction, alignment, and maintenance of precision optical systems.
Epistemic Blinding: An Inference-Time Protocol for Auditing Prior Contamination in LLM-Assisted Analysis
Michael Cuccarese · Apr 7, 2026 · Citations: 0

Demonstrations

This paper presents epistemic blinding in the context of an agentic system that uses large language models to reason across multiple biological datasets for drug target prioritization.
A Survey of On-Policy Distillation for Large Language Models
Mingyang Song, Mao Zheng · Apr 1, 2026 · Citations: 0

Expert VerificationDemonstrations

We systematically analyze representative methods, examine industrial deployments, and identify open problems including distillation scaling laws, uncertainty-aware feedback, and agent-level distillation.
In-Context Learning in Speech Language Models: Analyzing the Role of Acoustic Features, Linguistic Structure, and Induction Heads
Charlotte Pouw, Hosein Mohebbi, Afra Alishahi, Willem Zuidema · Apr 7, 2026 · Citations: 0

Demonstrations

In-Context Learning (ICL) has been extensively studied in text-only Language Models, but remains largely unexplored in the speech domain.
Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings
Yuning Wu, Ke Wang, Devin Chen, Kai Wei · Mar 11, 2026 · Citations: 0

Demonstrations

To resolve this dilemma, we introduce Hindsight-Anchored Policy Optimization (HAPO).
COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics
Kartik Sharma, Rakshit S. Trivedi · Mar 6, 2026 · Citations: 0

Pairwise PreferenceDemonstrations

Experiments across a variety of steering tasks and benchmarks demonstrate that COLD-Steer achieves upto 95% steering effectiveness while using 50 times fewer samples compared to the best baseline.

Related Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now