HFEPX Archive Slice

HFEPX Weekly Archive: 2025-W35

Updated from current HFEPX corpus (Apr 17, 2026). 21 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 17, 2026). 21 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Frequent quality control: Calibration. Frequently cited benchmark: BrowseComp. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Aug 31, 2025.

Papers: 21 Last published: Aug 31, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

High-Signal Coverage

100.0%

21 / 21 papers are not low-signal flagged.

Benchmark Anchors

23.8%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

57.1%

Papers with reported metric mentions in extraction output.

1 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Get this digest every Friday →

Why This Time Slice Matters

9.5% of papers report explicit human-feedback signals, led by demonstration data.
automatic metrics appears in 61.9% of papers in this hub.
BrowseComp is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is rater calibration (4.8% of papers).
Rater context is mostly domain experts, and annotation is commonly mixed annotation units; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP
Aug 28, 2025 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios
Aug 27, 2025 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: Accuracy
Diffusion Language Models Know the Answer Before Decoding
Aug 27, 2025 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: Cost
Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning
Aug 26, 2025 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: F1
Agri-Query: A Case Study on RAG vs. Long-Context LLMs for Cross-Lingual Technical Question Answering
Aug 25, 2025 · Citations: 0 · Score: 5.0

Eval: Llm As Judge, Automatic Metrics · Metrics: Accuracy
LaTeXTrans: Structured LaTeX Translation with Multi-Agent Coordination
Aug 26, 2025 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Accuracy

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP Aug 28, 2025	Automatic Metrics	DROP	Accuracy	Not reported
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios Aug 27, 2025	Automatic Metrics	MATH	Accuracy	Not reported
Diffusion Language Models Know the Answer Before Decoding Aug 27, 2025	Automatic Metrics	MMLU, GSM8K	Cost	Not reported
Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning Aug 26, 2025	Automatic Metrics	Reasoning Query0retrieval	F1	Not reported
Agri-Query: A Case Study on RAG vs. Long-Context LLMs for Cross-Lingual Technical Question Answering Aug 25, 2025	Llm As Judge, Automatic Metrics	Needle In A Haystack	Accuracy	Not reported
LaTeXTrans: Structured LaTeX Translation with Multi-Agent Coordination Aug 26, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
L-MARS: Legal Multi-Agent Workflow with Orchestrated Reasoning and Agentic Search Aug 31, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
On the Theoretical Limitations of Embedding-Based Retrieval Aug 28, 2025	Automatic Metrics	Not reported	Relevance	Not reported
AVIATOR: Towards AI-Agentic Vulnerability Injection Workflow for High-Fidelity, Large-Scale Code Security Dataset Aug 28, 2025	Automatic Metrics	Not reported	Accuracy, F1	Not reported
From Guidelines to Guarantees: A Graph-Based Evaluation Harness for Domain-Specific Evaluation of LLMs Aug 28, 2025	Automatic Metrics	Not reported	Accuracy	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (9.5% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (4.8% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (19% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (33.3% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (14.3% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (0% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 4.8% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (14.3% coverage).
Annotation unit is under-specified (0% coverage).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (BrowseComp vs DROP) before comparing methods.
Track metric sensitivity by reporting both accuracy and calibration error.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: BrowseComp Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 4.8% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (14.3% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (13)
Llm As Judge (1)
Simulation Env (1)

Top Metrics

Accuracy (5)
Calibration error (1)
Cost (1)
F1 (1)

Top Benchmarks

BrowseComp (1)
DROP (1)
MMLU (1)
Needle In A Haystack (1)

Quality Controls

Calibration (1)

Papers In This Archive Slice

L-MARS: Legal Multi-Agent Workflow with Orchestrated Reasoning and Agentic Search
Ziqi Wang, Boqin Yuan · Aug 31, 2025 · Citations: 0

Multi Agent

We present L-MARS (Legal Multi-Agent Workflow with Orchestrated Reasoning and Agentic Search), a multi-agent retrieval framework for grounded legal question answering that decomposes queries into structured sub-problems, retrieves evidence…
When Thinking Backfires: Mechanistic Insights Into Reasoning-Induced Misalignment
Hanqi Yan, Hainiu Xu, Siya Qi, Shu Yang, Yulan He · Aug 30, 2025 · Citations: 0
Estimating Parameter Fields in Multi-Physics PDEs from Scarce Measurements
Xuyang Li, Mahdi Masmoudi, Rami Gharbi, Nizar Lajnef, Vishnu Naresh Boddeti · Aug 29, 2025 · Citations: 0
On the Theoretical Limitations of Embedding-Based Retrieval
Orion Weller, Michael Boratko, Iftekhar Naim, Jinhyuk Lee · Aug 28, 2025 · Citations: 0

These new benchmarks push embeddings to work for any query and any notion of relevance that could be given.
EO-1: An Open Unified Embodied Foundation Model for General Robot Control
Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao · Aug 28, 2025 · Citations: 0

Long Horizon

The human ability to seamlessly perform multimodal reasoning and physical interaction in the open world is a core goal for general purpose embodied intelligent systems.
AVIATOR: Towards AI-Agentic Vulnerability Injection Workflow for High-Fidelity, Large-Scale Code Security Dataset
Amine Lbath, Massih-Reza Amini, Aurelien Delaitre, Vadim Okun · Aug 28, 2025 · Citations: 0

In this paper, we present AVIATOR, the first AI-agentic vulnerability injection framework.
From Guidelines to Guarantees: A Graph-Based Evaluation Harness for Domain-Specific Evaluation of LLMs
Jessica M. Lundin, Usman Nasir Nakakana, Guillaume Chabot-Couture · Aug 28, 2025 · Citations: 0

Rigorous evaluation of domain-specific language models requires benchmarks that are comprehensive, contamination-resistant, and maintainable.
Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP
Lorenz Hufe, Constantin Venhoff, Erblina Purelku, Maximilian Dreyer, Sebastian Lapuschkin · Aug 28, 2025 · Citations: 0

Red Team

These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.
NPG-Muse: Scaling Long Chain-of-Thought Reasoning with NP-Hard Graph Problems
Yuyao Wang, Bowen Liu, Jianheng Tang, Nuo Chen, Yuhan Li · Aug 28, 2025 · Citations: 0

However, developing these Long CoT behaviors relies heavily on post-training with high-quality datasets, which are typically costly and human-curated (e.g., mathematics and code), leaving scalable alternatives unexplored.
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios
Lisa Alazraki, Lihu Chen, Ana Brassard, Joe Stacey, Hossein A. Rahmani · Aug 27, 2025 · Citations: 0

In this work, we introduce an Agentic Commonsense and Math benchmark (AgentCoMa), where each compositional task requires a commonsense reasoning step and a math reasoning step.
Diffusion Language Models Know the Answer Before Decoding
Pengxiang Li, Yefan Zhou, Dilxat Muhtar, Lu Yin, Shilin Yan · Aug 27, 2025 · Citations: 0

Empirical evaluations of LLaDA-8B and Dream-7B across multiple tasks show that Prophet reduces the number of decoding steps by up to 3.4x while preserving high generation quality.
Your AI Bosses Are Still Prejudiced: The Emergence of Stereotypes in LLM-Based Multi-Agent Systems
Jingyu Guo, Yingying Xu · Aug 27, 2025 · Citations: 0

Multi Agent

While stereotypes are well-documented in human social interactions, AI systems are often presumed to be less susceptible to such biases.
The Information Dynamics of Generative Diffusion
Dejan Stancevic, Luca Ambrogioni · Aug 27, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Language and Experience: A Computational Model of Social Learning in Complex Tasks
Cédric Colas, Tracey Mills, Ben Prystawski, Michael Henry Tessler, Noah Goodman · Aug 26, 2025 · Citations: 0

The ability to combine linguistic guidance from others with direct experience is central to human development, enabling safe and rapid learning in new environments.
Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning
Dayoon Ko, Jihyuk Kim, Haeju Park, Sohyeon Kim, Dahyun Lee · Aug 26, 2025 · Citations: 0

Long Horizon

Large reasoning models (LRMs) combined with retrieval-augmented generation (RAG) have enabled deep research agents capable of multi-step reasoning with external knowledge retrieval.
LaTeXTrans: Structured LaTeX Translation with Multi-Agent Coordination
Ziming Zhu, Chenglong Wang, Haosong Xv, Shunjie Xing, Yifu Huo · Aug 26, 2025 · Citations: 0

Demonstrations Multi Agent

In this paper, we introduce LaTeXTrans, a collaborative multi-agent system designed to address this challenge.
VistaWise: Building Cost-Effective Agent with Cross-Modal Knowledge Graph for Minecraft
Honghao Fu, Junlong Ren, Qi Chai, Deheng Ye, Yujun Cai · Aug 26, 2025 · Citations: 0
Latent Self-Consistency for Reliable Majority-Set Selection in Short- and Long-Answer Reasoning
Jungsuk Oh, Jay-Yoon Lee · Aug 25, 2025 · Citations: 0
Why Synthetic Isn't Real Yet: A Diagnostic Framework for Contact Center Dialogue Generation
Rishikesh Devanathan, Varun Nathan, Ayush Kumar · Aug 25, 2025 · Citations: 0

In this work, we benchmark multiple generation strategies guided by structured supervision on call attributes (Intent Summaries, Topic Flows, and Quality Assurance (QA) Forms) across multiple languages.
Agri-Query: A Case Study on RAG vs. Long-Context LLMs for Cross-Lingual Technical Question Answering
Julius Gun, Timo Oksanen · Aug 25, 2025 · Citations: 0

Our benchmark is built on a user manual for an agricultural machine, available in English, French, and German.
How Quantization Shapes Bias in Large Language Models
Federico Marcuzzi, Xuefei Ning, Roy Schwartz, Iryna Gurevych · Aug 25, 2025 · Citations: 0

This work presents a comprehensive evaluation of how quantization affects model bias, with particular attention to its impact on individual demographic subgroups.

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now