HFEPX Archive Slice

HFEPX Fortnight Archive: 2025-F10

Updated from current HFEPX corpus (Mar 1, 2026). 10 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Mar 1, 2026). 10 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Frequent quality control: Calibration. Common metric signal: win rate. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from May 18, 2025.

Papers: 10 Last published: May 18, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Medium .

High-Signal Coverage

100.0%

10 / 10 papers are not low-signal flagged.

Benchmark Anchors

20.0%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

40.0%

Papers with reported metric mentions in extraction output.

1 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Why This Slice Matters (Expanded)

Why This Time Slice Matters

20% of papers report explicit human-feedback signals, led by demonstration data.
automatic metrics appears in 30% of papers in this hub.
long-horizon tasks appears in 10% of papers, indicating agentic evaluation demand.

Protocol Notes (Expanded)

Protocol Takeaways For This Period

Most common quality-control signal is rater calibration (10% of papers).
Rater context is mostly domain experts, and annotation is commonly mixed annotation units; use this to scope replication staffing.
Add inter-annotator agreement checks when reproducing these protocols.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

EVALOOOP: A Self-Consistency-Centered Framework for Assessing Large Language Model Robustness in Programming
May 18, 2025 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: Accuracy, Pass@1
Mastering Multi-Drone Volleyball through Hierarchical Co-Self-Play Reinforcement Learning
May 7, 2025 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Win rate
BARREL: Boundary-Aware Reasoning for Factual and Reliable LRMs
May 18, 2025 · Citations: 0 · Score: 3.5

Eval: Automatic Metrics · Metrics: Accuracy
Benchmarking Retrieval-Augmented Generation for Chemistry
May 12, 2025 · Citations: 0 · Score: 2.5

Eval: Not reported · Metrics: Not reported
Scalable LLM Reasoning Acceleration with Low-rank Distillation
May 8, 2025 · Citations: 0 · Score: 2.5

Eval: Not reported · Metrics: Latency
Multimodal Integrated Knowledge Transfer to Large Language Models through Preference Optimization with Biomedical Applications
May 9, 2025 · Citations: 0 · Score: 2.0

Eval: Not reported · Metrics: Not reported

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
EVALOOOP: A Self-Consistency-Centered Framework for Assessing Large Language Model Robustness in Programming May 18, 2025	Automatic Metrics	MBPP+, DROP	Accuracy, Pass@1	Not reported
Mastering Multi-Drone Volleyball through Hierarchical Co-Self-Play Reinforcement Learning May 7, 2025	Automatic Metrics	Not reported	Win rate	Not reported
BARREL: Boundary-Aware Reasoning for Factual and Reliable LRMs May 18, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
Benchmarking Retrieval-Augmented Generation for Chemistry May 12, 2025	Not reported	Chemrag Bench	Not reported	Not reported
Scalable LLM Reasoning Acceleration with Low-rank Distillation May 8, 2025	Not reported	Not reported	Latency	Not reported
Multimodal Integrated Knowledge Transfer to Large Language Models through Preference Optimization with Biomedical Applications May 9, 2025	Not reported	Not reported	Not reported	Not reported
ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization May 5, 2025	Not reported	Not reported	Not reported	Calibration
EAMET: Robust Massive Model Editing via Embedding Alignment Optimization May 17, 2025	Not reported	Not reported	Not reported	Not reported
Visual Planning: Let's Think Only with Images May 16, 2025	Not reported	Not reported	Not reported	Not reported
CodePDE: An Inference Framework for LLM-driven PDE Solver Generation May 13, 2025	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (20% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (10% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (0% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (10% vs 35% target).
Moderate: Papers with known rater population

Coverage is usable but incomplete (30% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (0% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 10% of papers report quality controls; prioritize calibration/adjudication evidence.
Annotation unit is under-specified (0% coverage).
Benchmark coverage is thin (0% of papers mention benchmarks/datasets).

Suggested Next Analyses

Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

Metric Slice: win rate Recent High-Signal Papers

Known Limitations

Only 10% of papers report quality controls; prioritize calibration/adjudication evidence.
Annotation unit is under-specified (0% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (3)

Top Metrics

Win rate (1)

Top Benchmarks

Quality Controls

Calibration (1)

Papers In This Archive Slice

BARREL: Boundary-Aware Reasoning for Factual and Reliable LRMs
Junxiao Yang, Jinzhe Tu, Haoran Liu, Xiaoce Wang, Chujie Zheng · May 18, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
EVALOOOP: A Self-Consistency-Centered Framework for Assessing Large Language Model Robustness in Programming
Sen Fang, Weiyuan Ding, Mengshi Zhang, Zihao Chen, Bowen Xu · May 18, 2025 · Citations: 0

However, adversarial attacks exhibit fundamental limitations that compromise fair robustness assessment: they demonstrate contradictory evaluation outcomes where different attack strategies tend to favor different models, and more…
EAMET: Robust Massive Model Editing via Embedding Alignment Optimization
Yanbo Dai, Zhenlan Ji, Zongjie Li, Shuai Wang · May 17, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Visual Planning: Let's Think Only with Images
Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang · May 16, 2025 · Citations: 0

Web Browsing

In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions.
CodePDE: An Inference Framework for LLM-driven PDE Solver Generation
Shanda Li, Tanya Marwah, Junhong Shen, Weiwei Sun, Andrej Risteski · May 13, 2025 · Citations: 0

With CodePDE, we present a thorough evaluation on critical capacities of LLM for PDE solving: reasoning, debugging, self-refinement, and test-time scaling.
Benchmarking Retrieval-Augmented Generation for Chemistry
Xianrui Zhong, Bowen Jin, Siru Ouyang, Yanzhen Shen, Qiao Jin · May 12, 2025 · Citations: 0

Despite its promise, the application of RAG in the chemistry domain remains underexplored, primarily due to the lack of high-quality, domain-specific corpora and well-curated evaluation benchmarks.
Multimodal Integrated Knowledge Transfer to Large Language Models through Preference Optimization with Biomedical Applications
Zhanliang Wang, Da Wu, Quan Nguyen, Zhuoran Xu, Kai Wang · May 9, 2025 · Citations: 0

Pairwise Preference

To address this challenge, we introduce MINT (Multimodal Integrated kNowledge Transfer), a framework that aligns unimodal large decoder models with domain-specific decision patterns from multimodal biomedical data through preference…
Scalable LLM Reasoning Acceleration with Low-rank Distillation
Harry Dong, Bilge Acun, Beidi Chen, Yuejie Chi · May 8, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Mastering Multi-Drone Volleyball through Hierarchical Co-Self-Play Reinforcement Learning
Ruize Zhang, Sirui Xiang, Zelai Xu, Feng Gao, Shilong Ji · May 7, 2025 · Citations: 0

Demonstrations Long Horizon

The task is turn-based, multi-agent, and physically grounded, posing significant challenges due to its long-horizon dependencies, tight inter-agent coupling, and the underactuated dynamics of quadrotors.
ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization
Dmitriy Shopkhoev, Ammar Ali, Magauiya Zhussip, Valentin Malykh, Stamatios Lefkimmiatis · May 5, 2025 · Citations: 0

Applied to several large language models (LLMs), ReplaceMe achieves up to 25\% pruning while retaining approximately 90\% of the original model's performance on open benchmarks - without any training or healing steps, resulting in minimal…

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote