HFEPX Archive Slice

HFEPX Fortnight Archive: 2025-F03

Updated from current HFEPX corpus (Mar 8, 2026). 18 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Mar 8, 2026). 18 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 9, 2025.

Papers: 18 Last published: Feb 9, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Medium .

High-Signal Coverage

100.0%

18 / 18 papers are not low-signal flagged.

Benchmark Anchors

5.6%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

50.0%

Papers with reported metric mentions in extraction output.

0 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Why This Time Slice Matters

22.2% of papers report explicit human-feedback signals, led by demonstration data.
automatic metrics appears in 55.6% of papers in this hub.
web-browsing tasks appears in 11.1% of papers, indicating agentic evaluation demand.

Protocol Takeaways For This Period

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly mixed annotation units; use this to scope replication staffing.
Track metric sensitivity by reporting both accuracy and latency.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play
Feb 4, 2025 · Citations: 0 · Score: 3.9

Eval: Automatic Metrics, Simulation Env · Metrics: Win rate
CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation
Jan 28, 2025 · Citations: 0 · Score: 3.9

Eval: Automatic Metrics · Metrics: Success rate, Task success
MoEMba: A Mamba-based Mixture of Experts for High-Density EMG-based Hand Gesture Recognition
Feb 9, 2025 · Citations: 0 · Score: 2.9

Eval: Automatic Metrics · Metrics: Accuracy
Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning
Feb 8, 2025 · Citations: 0 · Score: 2.9

Eval: Automatic Metrics · Metrics: Accuracy
vCache: Verified Semantic Prompt Caching
Feb 6, 2025 · Citations: 0 · Score: 2.9

Eval: Automatic Metrics · Metrics: Error rate, Latency
FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration
Feb 3, 2025 · Citations: 0 · Score: 2.9

Eval: Automatic Metrics · Metrics: Accuracy, Latency

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play Feb 4, 2025	Automatic Metrics, Simulation Env	Not reported	Win rate	Not reported
CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation Jan 28, 2025	Automatic Metrics	Not reported	Success rate, Task success	Not reported
MoEMba: A Mamba-based Mixture of Experts for High-Density EMG-based Hand Gesture Recognition Feb 9, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning Feb 8, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
vCache: Verified Semantic Prompt Caching Feb 6, 2025	Automatic Metrics	Not reported	Error rate, Latency	Not reported
FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration Feb 3, 2025	Automatic Metrics	Not reported	Accuracy, Latency	Not reported
Evaluating Spoken Language as a Biomarker for Automated Screening of Cognitive Impairment Jan 30, 2025	Automatic Metrics	Not reported	Cost	Not reported
Dialogue is Better Than Monologue: Instructing Medical LLMs via Strategical Conversations Jan 29, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
Dual-IPO: Dual-Iterative Preference Optimization for Text-to-Video Generation Feb 4, 2025	Automatic Metrics	Not reported	Not reported	Not reported
Intrinsic Entropy of Context Length Scaling in LLMs Feb 3, 2025	Not reported	Not reported	Context length	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (22.2% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (0% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (22.2% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (16.7% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (0% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (16.7% coverage).
Annotation unit is under-specified (0% coverage).

Suggested Next Analyses

Track metric sensitivity by reporting both accuracy and latency.

Recommended Queries

Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (16.7% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (10)
Simulation Env (1)

Top Metrics

Accuracy (1)
Latency (1)
Success rate (1)
Task success (1)

Top Benchmarks

Quality Controls

Papers In This Archive Slice

MoEMba: A Mamba-based Mixture of Experts for High-Density EMG-based Hand Gesture Recognition
Mehran Shabanpour, Kasra Rad, Sadaf Khademi, Arash Mohammadi · Feb 9, 2025 · Citations: 0

High-Density surface Electromyography (HDsEMG) has emerged as a pivotal resource for Human-Computer Interaction (HCI), offering direct insights into muscle activities and motion intentions.
Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning
Manh Luong, Khai Nguyen, Dinh Phung, Gholamreza Haffari, Lizhen Qu · Feb 8, 2025 · Citations: 0

Our kernel also improves the reasoning accuracy of the MMAU-test-mini benchmarks by 4\%.
Oracular Programming: A Modular Foundation for Building LLM-Enabled Software
Jonathan Laurent, André Platzer · Feb 7, 2025 · Citations: 0

Demonstrations Web Browsing

We propose oracular programming: a foundational paradigm for integrating traditional, explicit computations with inductive oracles such as LLMs.
vCache: Verified Semantic Prompt Caching
Luis Gaspar Schroeder, Aditya Desai, Alejandro Cuadron, Kyle Chu, Shu Liu · Feb 6, 2025 · Citations: 0

We release the vCache implementation and four benchmarks to support future research.
AStar: Boosting Multimodal Reasoning with Automated Structured Thinking
Jinyang Wu, Mingkuan Feng, Guocheng Zhai, Shuai Zhang, Zheng Lian · Feb 4, 2025 · Citations: 0
Dual-IPO: Dual-Iterative Preference Optimization for Text-to-Video Generation
Xiaomeng Yang, Mengping Yang, Jia Gong, Luozheng Qin, Zhiyu Tan · Feb 4, 2025 · Citations: 0

Pairwise Preference

However, they usually fail to produce satisfactory outputs that are aligned to users' authentic demands and preferences.
FinBloom: Knowledge Grounding Large Language Model with Real-time Financial Data
Ankur Sinha, Chaitanya Agarwal, Pekka Malo · Feb 4, 2025 · Citations: 0
VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play
Zelai Xu, Ruize Zhang, Chao Yu, Huining Yuan, Xiangmin Yi · Feb 4, 2025 · Citations: 0

Demonstrations Multi Agent

We provide a comprehensive suite of tasks ranging from single-drone drills to multi-drone cooperative and competitive tasks, accompanied by baseline evaluations of representative reinforcement learning (RL), multi-agent reinforcement…
Preference Leakage: A Contamination Problem in LLM-as-a-judge
Dawei Li, Renliang Sun, Yue Huang, Ming Zhong, Bohan Jiang · Feb 3, 2025 · Citations: 0
Intrinsic Entropy of Context Length Scaling in LLMs
Jingzhe Shi, Qinwei Ma, Hongyi Liu, Hang Zhao, Jeng-Neng Hwang · Feb 3, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Polynomial, trigonometric, and tropical activations
Ismail Khalfaoui-Hassani, Stefan Kesselheim · Feb 3, 2025 · Citations: 0
FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration
Dongwon Jo, Jiwon Song, Yulhwa Kim, Jae-Joon Kim · Feb 3, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Should You Use Your Large Language Model to Explore or Exploit?
Keegan Harris, Aleksandrs Slivkins · Jan 31, 2025 · Citations: 0

Tool Use

We evaluate the ability of the current generation of large language models (LLMs) to help a decision-making agent facing an exploration-exploitation tradeoff.
Evaluating Spoken Language as a Biomarker for Automated Screening of Cognitive Impairment
Maria R. Lima, Alexander Capstick, Fatemeh Geranmayeh, Ramin Nilforooshan, Maja Matarić · Jan 30, 2025 · Citations: 0

We evaluate explainable ML for screening of Alzheimer's disease and related dementias (ADRD) and severity prediction using benchmark DementiaBank speech (N = 291, 64% female, 69.8 (SD = 8.6) years).
Dialogue is Better Than Monologue: Instructing Medical LLMs via Strategical Conversations
Zijie Liu, Xinyu Zhao, Jie Peng, Zhuangdi Zhu, Qingyu Chen · Jan 29, 2025 · Citations: 0

These tuning methods and benchmarks overlook critical aspects like evidence-based reasoning and handling distracting information.
Safe Reinforcement Learning for Real-World Engine Control
Julian Bedei, Lucas Koch, Kevin Badalian, Alexander Winkler, Patrick Schaber · Jan 28, 2025 · Citations: 0

This work introduces a toolchain for applying Reinforcement Learning (RL), specifically the Deep Deterministic Policy Gradient (DDPG) algorithm, in safety-critical real-world environments.
CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation
Faria Huq, Zora Zhiruo Wang, Frank F. Xu, Tianyue Ou, Shuyan Zhou · Jan 28, 2025 · Citations: 0

Pairwise PreferenceDemonstrations Web Browsing

We propose CowPilot, a framework supporting autonomous as well as human-agent collaborative web navigation, and evaluation across task success and task efficiency.
Object-Centric World Models from Few-Shot Annotations for Sample-Efficient Reinforcement Learning
Weipu Zhang, Adam Jelley, Trevor McInroe, Amos Storkey, Gang Wang · Jan 27, 2025 · Citations: 0

Empirical results demonstrate that OC-STORM significantly outperforms the STORM baseline on the Atari 100k benchmark and achieves state-of-the-art sample efficiency on challenging boss fights in the visually complex game Hollow Knight.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote