HFEPX Archive Slice

HFEPX Weekly Archive: 2025-W35

Updated from current HFEPX corpus (Mar 1, 2026). 8 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Mar 1, 2026). 8 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Frequent quality control: Calibration. Frequently cited benchmark: BrowseComp. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Aug 28, 2025.

Papers: 8 Last published: Aug 28, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Medium .

High-Signal Coverage

100.0%

8 / 8 papers are not low-signal flagged.

Benchmark Anchors

37.5%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

50.0%

Papers with reported metric mentions in extraction output.

1 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Why This Slice Matters (Expanded)

Why This Time Slice Matters

12.5% of papers report explicit human-feedback signals, led by red-team protocols.
automatic metrics appears in 62.5% of papers in this hub.
BrowseComp is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Notes (Expanded)

Protocol Takeaways For This Period

Most common quality-control signal is rater calibration (12.5% of papers).
Stratify by benchmark (BrowseComp vs DROP) before comparing methods.
Track metric sensitivity by reporting both accuracy and f1.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP
Aug 28, 2025 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy
Diffusion Language Models Know the Answer Before Decoding
Aug 27, 2025 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: Cost
Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning
Aug 26, 2025 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: F1
NPG-Muse: Scaling Long Chain-of-Thought Reasoning with NP-Hard Graph Problems
Aug 28, 2025 · Citations: 0 · Score: 3.5

Eval: Automatic Metrics · Metrics: Accuracy
EO-1: An Open Unified Embodied Foundation Model for General Robot Control
Aug 28, 2025 · Citations: 0 · Score: 2.0

Eval: Automatic Metrics · Metrics: Not reported
Language and Experience: A Computational Model of Social Learning in Complex Tasks
Aug 26, 2025 · Citations: 0 · Score: 2.0

Eval: Simulation Env · Metrics: Not reported

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP Aug 28, 2025	Automatic Metrics	DROP	Accuracy	Not reported
Diffusion Language Models Know the Answer Before Decoding Aug 27, 2025	Automatic Metrics	MMLU, GSM8K	Cost	Not reported
Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning Aug 26, 2025	Automatic Metrics	Reasoning Query0retrieval	F1	Not reported
NPG-Muse: Scaling Long Chain-of-Thought Reasoning with NP-Hard Graph Problems Aug 28, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
EO-1: An Open Unified Embodied Foundation Model for General Robot Control Aug 28, 2025	Automatic Metrics	Not reported	Not reported	Not reported
Language and Experience: A Computational Model of Social Learning in Complex Tasks Aug 26, 2025	Simulation Env	Not reported	Not reported	Not reported
Why Synthetic Isn't Real Yet: A Diagnostic Framework for Contact Center Dialogue Generation Aug 25, 2025	Not reported	Not reported	Not reported	Calibration
Your AI Bosses Are Still Prejudiced: The Emergence of Stereotypes in LLM-Based Multi-Agent Systems Aug 27, 2025	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (12.5% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (12.5% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (25% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (25% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (0% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (0% vs 35% target).

Strengths

Agentic evaluation appears in 37.5% of papers.

Known Gaps

Only 12.5% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Annotation unit is under-specified (0% coverage).

Suggested Next Analyses

Stratify by benchmark (BrowseComp vs DROP) before comparing methods.
Track metric sensitivity by reporting both accuracy and f1.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

Benchmark Slice: BrowseComp Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 12.5% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (5)
Simulation Env (1)

Top Metrics

Accuracy (1)
F1 (1)

Top Benchmarks

BrowseComp (1)
DROP (1)
Reasoning Query0retrieval (1)

Quality Controls

Calibration (1)

Papers In This Archive Slice

EO-1: An Open Unified Embodied Foundation Model for General Robot Control
Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao · Aug 28, 2025 · Citations: 0

Long Horizon

The human ability to seamlessly perform multimodal reasoning and physical interaction in the open world is a core goal for general purpose embodied intelligent systems.
Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP
Lorenz Hufe, Constantin Venhoff, Erblina Purelku, Maximilian Dreyer, Sebastian Lapuschkin · Aug 28, 2025 · Citations: 0

Red Team

These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.
NPG-Muse: Scaling Long Chain-of-Thought Reasoning with NP-Hard Graph Problems
Yuyao Wang, Bowen Liu, Jianheng Tang, Nuo Chen, Yuhan Li · Aug 28, 2025 · Citations: 0

However, developing these Long CoT behaviors relies heavily on post-training with high-quality datasets, which are typically costly and human-curated (e.g., mathematics and code), leaving scalable alternatives unexplored.
Diffusion Language Models Know the Answer Before Decoding
Pengxiang Li, Yefan Zhou, Dilxat Muhtar, Lu Yin, Shilin Yan · Aug 27, 2025 · Citations: 0

Empirical evaluations of LLaDA-8B and Dream-7B across multiple tasks show that Prophet reduces the number of decoding steps by up to 3.4x while preserving high generation quality.
Your AI Bosses Are Still Prejudiced: The Emergence of Stereotypes in LLM-Based Multi-Agent Systems
Jingyu Guo, Yingying Xu · Aug 27, 2025 · Citations: 0

Multi Agent

While stereotypes are well-documented in human social interactions, AI systems are often presumed to be less susceptible to such biases.
Language and Experience: A Computational Model of Social Learning in Complex Tasks
Cédric Colas, Tracey Mills, Ben Prystawski, Michael Henry Tessler, Noah Goodman · Aug 26, 2025 · Citations: 0

The ability to combine linguistic guidance from others with direct experience is central to human development, enabling safe and rapid learning in new environments.
Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning
Dayoon Ko, Jihyuk Kim, Haeju Park, Sohyeon Kim, Dahyun Lee · Aug 26, 2025 · Citations: 0

Long Horizon

Large reasoning models (LRMs) combined with retrieval-augmented generation (RAG) have enabled deep research agents capable of multi-step reasoning with external knowledge retrieval.
Why Synthetic Isn't Real Yet: A Diagnostic Framework for Contact Center Dialogue Generation
Rishikesh Devanathan, Varun Nathan, Ayush Kumar · Aug 25, 2025 · Citations: 0

In this work, we benchmark multiple generation strategies guided by structured supervision on call attributes (Intent Summaries, Topic Flows, and Quality Assurance (QA) Forms) across multiple languages.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote