HFEPX Hub

Automatic Metrics + Demonstrations Papers

Updated from current HFEPX corpus (Feb 27, 2026). 13 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequently cited benchmark: Retrieval. Common metric signal: cost. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 13 Last published: Feb 26, 2026 Global RSS Tag RSS

Automatic MetricsDemonstrations

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 13 papers for Automatic Metrics + Demonstrations Papers. Dominant protocol signals include automatic metrics, with frequent benchmark focus on Retrieval, Auditbench and metric focus on cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

100% of papers report explicit human-feedback signals, led by demonstration data.

Evidence: Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models , AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors , FewMMBench: A Benchmark for Multimodal Few-Shot Learning , Explore-on-Graph: Incentivizing Autonomous Exploration of Large Language Models on Knowledge Graphs with Path-refined Reward Modeling
automatic metrics appears in 100% of papers in this hub.

Evidence: Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models , AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors , FewMMBench: A Benchmark for Multimodal Few-Shot Learning , Explore-on-Graph: Incentivizing Autonomous Exploration of Large Language Models on Knowledge Graphs with Path-refined Reward Modeling
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: FewMMBench: A Benchmark for Multimodal Few-Shot Learning , From Labor to Collaboration: A Methodological Experiment Using AI Agents to Augment Research Perspectives in Taiwan's Humanities and Social Sciences , Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models , AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.

Evidence: Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models , AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors , FewMMBench: A Benchmark for Multimodal Few-Shot Learning , Explore-on-Graph: Incentivizing Autonomous Exploration of Large Language Models on Knowledge Graphs with Path-refined Reward Modeling
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.

Evidence: Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning , Oracular Programming: A Modular Foundation for Building LLM-Enabled Software , Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models , AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors
Stratify by benchmark (Retrieval vs Auditbench) before comparing methods.

Evidence: Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models , AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors , FewMMBench: A Benchmark for Multimodal Few-Shot Learning , Explore-on-Graph: Incentivizing Autonomous Exploration of Large Language Models on Knowledge Graphs with Path-refined Reward Modeling

Benchmark Interpretation

Retrieval appears in 15.4% of hub papers (2/13); use this cohort for benchmark-matched comparisons.
Auditbench appears in 7.7% of hub papers (1/13); use this cohort for benchmark-matched comparisons.

Metric Interpretation

cost is reported in 7.7% of hub papers (1/13); compare with a secondary metric before ranking methods.

Researcher Checklist

Maintain strength on Papers with explicit human feedback. Coverage is strong (100% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (0% vs 30% target).
Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (23.1% vs 35% target).
Close gap on Papers naming evaluation metrics. Coverage is a replication risk (7.7% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (15.4% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (7.7% vs 35% target).

Papers with explicit human feedback

Coverage is strong (100% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (23.1% vs 35% target).

Papers naming evaluation metrics

Coverage is a replication risk (7.7% vs 35% target).

Papers with known rater population

Coverage is a replication risk (15.4% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (7.7% vs 35% target).

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (15.4% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: cost - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

Benchmark Brief

Retrieval

Coverage: 2 papers (15.4%)

2 papers (15.4%) mention Retrieval.

Examples: FewMMBench: A Benchmark for Multimodal Few-Shot Learning , From Labor to Collaboration: A Methodological Experiment Using AI Agents to Augment Research Perspectives in Taiwan's Humanities and Social Sciences

Benchmark Brief

Auditbench

Coverage: 1 papers (7.7%)

1 papers (7.7%) mention Auditbench.

Examples: AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

Benchmark Brief

Fewmmbench

Coverage: 1 papers (7.7%)

1 papers (7.7%) mention Fewmmbench.

Examples: FewMMBench: A Benchmark for Multimodal Few-Shot Learning

Metric Brief

cost

Coverage: 1 papers (7.7%)

1 papers (7.7%) mention cost.

Examples: Orchestration-Free Customer Service Automation: A Privacy-Preserving and Flowchart-Guided Framework

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models , AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors , FewMMBench: A Benchmark for Multimodal Few-Shot Learning

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers

Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models
Chungpa Lee, Jy-yong Sohn, Kangwook Lee · Feb 26, 2026 · Citations: 0

Demonstrations Automatic Metrics

Transformer-based large language models exhibit in-context learning, enabling adaptation to downstream tasks via few-shot prompting with demonstrations.
AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors
Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R. Bowman · Feb 26, 2026 · Citations: 0

Demonstrations Automatic Metrics

We introduce AuditBench, an alignment auditing benchmark.
FewMMBench: A Benchmark for Multimodal Few-Shot Learning
Mustafa Dogan, Ilker Kesen, Iacer Calixto, Aykut Erdem, Erkut Erdem · Feb 25, 2026 · Citations: 0

Demonstrations Automatic Metrics

In this paper, we introduce FewMMBench, a comprehensive benchmark designed to evaluate MLLMs under few-shot conditions, with a focus on In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting.
Explore-on-Graph: Incentivizing Autonomous Exploration of Large Language Models on Knowledge Graphs with Path-refined Reward Modeling
Shiqi Yan, Yubo Chen, Ruiqi Zhou, Zhengxi Yao, Shuai Chen · Feb 25, 2026 · Citations: 0

Demonstrations Automatic Metrics

Extensive experiments on five KGQA benchmark datasets demonstrate that, to the best of our knowledge, our method achieves state-of-the-art performance, outperforming not only open-source but also even closed-source LLMs.
Inner Speech as Behavior Guides: Steerable Imitation of Diverse Behaviors for Human-AI coordination
Rakshit Trivedi, Kartik Sharma, David C Parkes · Feb 24, 2026 · Citations: 0

Demonstrations Automatic Metrics Multi Agent

Effective human-AI coordination requires artificial agents capable of exhibiting and responding to human-like behaviors while adapting to changing contexts.
From Labor to Collaboration: A Methodological Experiment Using AI Agents to Augment Research Perspectives in Taiwan's Humanities and Social Sciences
Yi-Chih Huang · Feb 19, 2026 · Citations: 0

Demonstrations Automatic Metrics

Generative AI is reshaping knowledge work, yet existing research focuses predominantly on software engineering and the natural sciences, with limited methodological exploration for the humanities and social sciences.
Perspectives - Interactive Document Clustering in the Discourse Analysis Tool Suite
Tim Fischer, Chris Biemann · Feb 17, 2026 · Citations: 0

Demonstrations Automatic Metrics

This paper introduces Perspectives, an interactive extension of the Discourse Analysis Tool Suite designed to empower Digital Humanities (DH) scholars to explore and organize large, unstructured document collections.
Orchestration-Free Customer Service Automation: A Privacy-Preserving and Flowchart-Guided Framework
Mengze Hong, Chen Jason Zhang, Zichang Guo, Hanlin Gu, Di Jiang · Feb 17, 2026 · Citations: 0

Demonstrations Automatic Metrics

Existing approaches either rely on modular system designs with extensive agent orchestration or employ over-simplified instruction schemas, providing limited guidance and poor generalizability.
AITutor-EvalKit: Exploring the Capabilities of AI Tutors
Numaan Naeem, Kaushal Kumar Maurya, Kseniia Petukhova, Ekaterina Kochmar · Dec 3, 2025 · Citations: 0

Demonstrations Automatic Metrics

We present AITutor-EvalKit, an application that uses language technology to evaluate the pedagogical quality of AI tutors, provides software for demonstration and evaluation, as well as model inspection and data visualization.
Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
Yihe Deng, I-Hung Hsu, Jun Yan, Zifeng Wang, Rujun Han · Oct 29, 2025 · Citations: 0

Demonstrations Automatic Metrics Long Horizon

Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.
Mapping Semantic & Syntactic Relationships with Geometric Rotation
Michael Freenor, Lauren Alvarez · Oct 10, 2025 · Citations: 0

Demonstrations Automatic Metrics

Understanding how language and embedding models encode semantic relationships is fundamental to model interpretability.
Programming by Backprop: An Instruction is Worth 100 Examples When Finetuning LLMs
Jonathan Cook, Silvia Sapora, Arash Ahmadian, Akbir Khan, Tim Rocktaschel · Jun 23, 2025 · Citations: 0

Demonstrations Automatic Metrics

Though execution of instructions in training data remains less reliable than when instructions are given in-context, our results demonstrate that procedural knowledge can be noisily `programmed' into LLMs through PBB, with important implica
Oracular Programming: A Modular Foundation for Building LLM-Enabled Software
Jonathan Laurent, André Platzer · Feb 7, 2025 · Citations: 0

Demonstrations Automatic Metrics Web Browsing

Large Language Models can solve a wide range of tasks from just a few examples, but they remain difficult to steer and lack a capability essential for building reliable software at scale: the modular composition of computations under enforc

Automatic Metrics + Demonstrations Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers

Related Hubs