HFEPX Archive Slice

HFEPX Weekly Archive: 2025-W50

Updated from current HFEPX corpus (Mar 8, 2026). 9 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Mar 8, 2026). 9 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Common annotation unit: Trajectory. Common metric signal: auc-pr. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Dec 11, 2025.

Papers: 9 Last published: Dec 11, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Medium .

High-Signal Coverage

100.0%

9 / 9 papers are not low-signal flagged.

Benchmark Anchors

0.0%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

44.4%

Papers with reported metric mentions in extraction output.

0 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Why This Time Slice Matters

11.1% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 33.3% of papers in this hub.
long-horizon tasks appears in 11.1% of papers, indicating agentic evaluation demand.

Protocol Takeaways For This Period

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly unspecified rater pools, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Track metric sensitivity by reporting both auc-pr and cost.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

KD-OCT: Efficient Knowledge Distillation for Clinical-Grade Retinal OCT Classification
Dec 9, 2025 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Accuracy, Coherence
QSTN: A Modular Framework for Robust Questionnaire Inference with Large Language Models
Dec 9, 2025 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Cost
Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning
Dec 9, 2025 · Citations: 0 · Score: 4.5

Eval: Simulation Env · Metrics: Cost
Group Representational Position Encoding
Dec 8, 2025 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Cost
Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution
Dec 11, 2025 · Citations: 0 · Score: 3.0

Eval: Not reported · Metrics: Not reported
Interpreto: An Explainability Library for Transformers
Dec 10, 2025 · Citations: 0 · Score: 2.0

Eval: Not reported · Metrics: Not reported

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
KD-OCT: Efficient Knowledge Distillation for Clinical-Grade Retinal OCT Classification Dec 9, 2025	Automatic Metrics	Not reported	Accuracy, Coherence	Not reported
QSTN: A Modular Framework for Robust Questionnaire Inference with Large Language Models Dec 9, 2025	Automatic Metrics	Not reported	Cost	Not reported
Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning Dec 9, 2025	Simulation Env	Not reported	Cost	Not reported
Group Representational Position Encoding Dec 8, 2025	Automatic Metrics	Not reported	Cost	Not reported
Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution Dec 11, 2025	Not reported	Not reported	Not reported	Not reported
Interpreto: An Explainability Library for Transformers Dec 10, 2025	Not reported	Not reported	Not reported	Not reported
GUMBridge: a Corpus for Varieties of Bridging Anaphora Dec 8, 2025	Not reported	Not reported	Not reported	Not reported
What Triggers my Model? Contrastive Explanations Inform Gender Choices by Translation Models Dec 9, 2025	Not reported	Not reported	Not reported	Not reported
Near--Real-Time Conflict-Related Fire Detection in Sudan Using Unsupervised Deep Learning Dec 8, 2025	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (11.1% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (0% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (33.3% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (0% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (11.1% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Annotation unit is under-specified (11.1% coverage).

Suggested Next Analyses

Track metric sensitivity by reporting both auc-pr and cost.

Recommended Queries

Metric Slice: auc-pr Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (3)
Simulation Env (1)

Top Metrics

Auc Pr (1)
Cost (1)
F1 (1)
Precision (1)

Top Benchmarks

Quality Controls

Papers In This Archive Slice

Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution
Jonathan Kamp, Roos Bakker, Dominique Blok · Dec 11, 2025 · Citations: 0

Pairwise Preference

In this work, we delve beyond the superficial inconsistencies between attribution methods, structuring their biases through a model- and method-agnostic framework of three evaluation metrics.
Interpreto: An Explainability Library for Transformers
Antonin Poché, Thomas Mullor, Gabriele Sarti, Frédéric Boisnard, Corentin Friedrich · Dec 10, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
KD-OCT: Efficient Knowledge Distillation for Clinical-Grade Retinal OCT Classification
Erfan Nourbakhsh, Nasrin Sanjari, Ali Nourbakhsh · Dec 9, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
QSTN: A Modular Framework for Robust Questionnaire Inference with Large Language Models
Maximilian Kreutner, Jens Rupprecht, Georg Ahnert, Ahmed Salem, Markus Strohmaier · Dec 9, 2025 · Citations: 0

QSTN enables robust evaluation of questionnaire presentation, prompt perturbations, and response generation methods.
Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning
Huilin Xu, Zhuoyang Liu, Yixiang Luomei, Feng Xu · Dec 9, 2025 · Citations: 0

Long Horizon

Extensive experiments on the AerialVLN and OpenFly benchmark validate the effectiveness of our method.
What Triggers my Model? Contrastive Explanations Inform Gender Choices by Translation Models
Janiça Hackenbuchner, Arda Tezcan, Joke Daems · Dec 9, 2025 · Citations: 0
Near--Real-Time Conflict-Related Fire Detection in Sudan Using Unsupervised Deep Learning
Kuldip Singh Atwal, Dieter Pfoser, Daniel Rothbart · Dec 8, 2025 · Citations: 0
Group Representational Position Encoding
Yifan Zhang, Zixiang Chen, Yifeng Liu, Zhen Qin, Huizhuo Yuan · Dec 8, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
GUMBridge: a Corpus for Varieties of Bridging Anaphora
Lauren Levine, Amir Zeldes · Dec 8, 2025 · Citations: 0

We also present an evaluation of annotation quality and report on baseline performance using open and closed source contemporary LLMs on three tasks underlying our data, showing that bridging resolution and subtype classification remain…

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote