HFEPX Archive Slice

HFEPX Fortnight Archive: 2025-F18

Updated from current HFEPX corpus (Mar 1, 2026). 14 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Mar 1, 2026). 14 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: BrowseComp. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Sep 4, 2025.

Papers: 14 Last published: Sep 4, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Medium .

High-Signal Coverage

100.0%

14 / 14 papers are not low-signal flagged.

Benchmark Anchors

21.4%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

42.9%

Papers with reported metric mentions in extraction output.

1 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Why This Slice Matters (Expanded)

Why This Time Slice Matters

14.3% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 50% of papers in this hub.
BrowseComp is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Notes (Expanded)

Protocol Takeaways For This Period

Most common quality-control signal is rater calibration (7.1% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP
Aug 28, 2025 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy
Diffusion Language Models Know the Answer Before Decoding
Aug 27, 2025 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: Cost
Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning
Aug 26, 2025 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: F1
Error Notebook-Guided, Training-Free Part Retrieval in 3D CAD Assemblies via Vision-Language Models
Sep 1, 2025 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Accuracy
Do LLMs Adhere to Label Definitions? Examining Their Receptivity to External Label Definitions
Sep 2, 2025 · Citations: 0 · Score: 3.5

Eval: Automatic Metrics · Metrics: Accuracy
NPG-Muse: Scaling Long Chain-of-Thought Reasoning with NP-Hard Graph Problems
Aug 28, 2025 · Citations: 0 · Score: 3.5

Eval: Automatic Metrics · Metrics: Accuracy

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP Aug 28, 2025	Automatic Metrics	DROP	Accuracy	Not reported
Diffusion Language Models Know the Answer Before Decoding Aug 27, 2025	Automatic Metrics	MMLU, GSM8K	Cost	Not reported
Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning Aug 26, 2025	Automatic Metrics	Reasoning Query0retrieval	F1	Not reported
Error Notebook-Guided, Training-Free Part Retrieval in 3D CAD Assemblies via Vision-Language Models Sep 1, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
Do LLMs Adhere to Label Definitions? Examining Their Receptivity to External Label Definitions Sep 2, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
NPG-Muse: Scaling Long Chain-of-Thought Reasoning with NP-Hard Graph Problems Aug 28, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios Sep 4, 2025	Llm As Judge	Not reported	Not reported	Not reported
BioBlue: Systematic runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format Sep 2, 2025	Simulation Env	Not reported	Not reported	Not reported
EO-1: An Open Unified Embodied Foundation Model for General Robot Control Aug 28, 2025	Automatic Metrics	Not reported	Not reported	Not reported
Language and Experience: A Computational Model of Social Learning in Complex Tasks Aug 26, 2025	Simulation Env	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (14.3% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (7.1% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (14.3% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (21.4% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (7.1% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (7.1% vs 35% target).

Strengths

Agentic evaluation appears in 35.7% of papers.

Known Gaps

Only 7.1% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (7.1% coverage).
Annotation unit is under-specified (7.1% coverage).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (BrowseComp vs DROP) before comparing methods.
Track metric sensitivity by reporting both accuracy and f1.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: BrowseComp Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 7.1% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (7.1% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (7)
Simulation Env (2)
Llm As Judge (1)

Top Metrics

Accuracy (2)
F1 (1)

Top Benchmarks

BrowseComp (1)
DROP (1)
Reasoning Query0retrieval (1)

Quality Controls

Calibration (1)

Papers In This Archive Slice

Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios
Jingen Qu, Lijun Li, Bo Zhang, Yichen Yan, Jing Shao · Sep 4, 2025 · Citations: 0

Multimodal large language models (MLLMs) are rapidly evolving, presenting increasingly complex safety challenges.
Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection
Shan Wang, Maying Shen, Nadine Chang, Chuong Nguyen, Hongdong Li · Sep 3, 2025 · Citations: 0

Experiments across multiple benchmarks demonstrate that GACD effectively reduces hallucinations and improves the visual grounding of MLLM outputs.
Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR
Jiaming Li, Longze Chen, Ze Gong, Yukun Chen, Lu Wang · Sep 2, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Do LLMs Adhere to Label Definitions? Examining Their Receptivity to External Label Definitions
Seyedali Mohammadi, Bhaskara Hanuma Vedula, Hemank Lamba, Edward Raff, Ponnurangam Kumaraguru · Sep 2, 2025 · Citations: 0

To address these questions, we conduct controlled experiments across multiple explanation benchmark datasets (general and domain-specific) and label definition conditions, including expert-curated, LLM-generated, perturbed, and swapped…
BioBlue: Systematic runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format
Roland Pihlakas, Sruthi Susan Kuriakose · Sep 2, 2025 · Citations: 0

Long Horizon

Many AI alignment discussions of "runaway optimisation" focus on RL agents: unbounded utility maximisers that over-optimise a proxy objective (e.g., "paperclip maximiser", specification gaming) at the expense of everything else.
Error Notebook-Guided, Training-Free Part Retrieval in 3D CAD Assemblies via Vision-Language Models
Yunqing Liu, Nan Zhang, Zhiming Tan · Sep 1, 2025 · Citations: 0

Pairwise Preference Long Horizon

We additionally contribute a CAD dataset with human preference annotations.
EO-1: An Open Unified Embodied Foundation Model for General Robot Control
Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao · Aug 28, 2025 · Citations: 0

Long Horizon

The human ability to seamlessly perform multimodal reasoning and physical interaction in the open world is a core goal for general purpose embodied intelligent systems.
Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP
Lorenz Hufe, Constantin Venhoff, Erblina Purelku, Maximilian Dreyer, Sebastian Lapuschkin · Aug 28, 2025 · Citations: 0

Red Team

These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.
NPG-Muse: Scaling Long Chain-of-Thought Reasoning with NP-Hard Graph Problems
Yuyao Wang, Bowen Liu, Jianheng Tang, Nuo Chen, Yuhan Li · Aug 28, 2025 · Citations: 0

However, developing these Long CoT behaviors relies heavily on post-training with high-quality datasets, which are typically costly and human-curated (e.g., mathematics and code), leaving scalable alternatives unexplored.
Diffusion Language Models Know the Answer Before Decoding
Pengxiang Li, Yefan Zhou, Dilxat Muhtar, Lu Yin, Shilin Yan · Aug 27, 2025 · Citations: 0

Empirical evaluations of LLaDA-8B and Dream-7B across multiple tasks show that Prophet reduces the number of decoding steps by up to 3.4x while preserving high generation quality.
Your AI Bosses Are Still Prejudiced: The Emergence of Stereotypes in LLM-Based Multi-Agent Systems
Jingyu Guo, Yingying Xu · Aug 27, 2025 · Citations: 0

Multi Agent

While stereotypes are well-documented in human social interactions, AI systems are often presumed to be less susceptible to such biases.
Language and Experience: A Computational Model of Social Learning in Complex Tasks
Cédric Colas, Tracey Mills, Ben Prystawski, Michael Henry Tessler, Noah Goodman · Aug 26, 2025 · Citations: 0

The ability to combine linguistic guidance from others with direct experience is central to human development, enabling safe and rapid learning in new environments.
Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning
Dayoon Ko, Jihyuk Kim, Haeju Park, Sohyeon Kim, Dahyun Lee · Aug 26, 2025 · Citations: 0

Long Horizon

Large reasoning models (LRMs) combined with retrieval-augmented generation (RAG) have enabled deep research agents capable of multi-step reasoning with external knowledge retrieval.
Why Synthetic Isn't Real Yet: A Diagnostic Framework for Contact Center Dialogue Generation
Rishikesh Devanathan, Varun Nathan, Ayush Kumar · Aug 25, 2025 · Citations: 0

In this work, we benchmark multiple generation strategies guided by structured supervision on call attributes (Intent Summaries, Topic Flows, and Quality Assurance (QA) Forms) across multiple languages.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote