HFEPX Archive Slice

HFEPX Fortnight Archive: 2025-F18

Updated from current HFEPX corpus (Apr 17, 2026). 42 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 17, 2026). 42 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: BrowseComp. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Sep 7, 2025.

Papers: 42 Last published: Sep 7, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

High-Signal Coverage

100.0%

42 / 42 papers are not low-signal flagged.

Benchmark Anchors

11.9%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

40.5%

Papers with reported metric mentions in extraction output.

1 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Get this digest every Friday →

Why This Time Slice Matters

11.9% of papers report explicit human-feedback signals, led by demonstration data.
automatic metrics appears in 42.9% of papers in this hub.
BrowseComp is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is rater calibration (2.4% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP
Aug 28, 2025 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios
Aug 27, 2025 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: Accuracy
Diffusion Language Models Know the Answer Before Decoding
Aug 27, 2025 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: Cost
Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning
Aug 26, 2025 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: F1
Agri-Query: A Case Study on RAG vs. Long-Context LLMs for Cross-Lingual Technical Question Answering
Aug 25, 2025 · Citations: 0 · Score: 5.0

Eval: Llm As Judge, Automatic Metrics · Metrics: Accuracy
Error Notebook-Guided, Training-Free Part Retrieval in 3D CAD Assemblies via Vision-Language Models
Sep 1, 2025 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Accuracy

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP Aug 28, 2025	Automatic Metrics	DROP	Accuracy	Not reported
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios Aug 27, 2025	Automatic Metrics	MATH	Accuracy	Not reported
Diffusion Language Models Know the Answer Before Decoding Aug 27, 2025	Automatic Metrics	MMLU, GSM8K	Cost	Not reported
Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning Aug 26, 2025	Automatic Metrics	Reasoning Query0retrieval	F1	Not reported
Agri-Query: A Case Study on RAG vs. Long-Context LLMs for Cross-Lingual Technical Question Answering Aug 25, 2025	Llm As Judge, Automatic Metrics	Needle In A Haystack	Accuracy	Not reported
Error Notebook-Guided, Training-Free Part Retrieval in 3D CAD Assemblies via Vision-Language Models Sep 1, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
LaTeXTrans: Structured LaTeX Translation with Multi-Agent Coordination Aug 26, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
New Insights into Optimal Alignment of Acoustic and Linguistic Representations for Knowledge Transfer in ASR Sep 6, 2025	Automatic Metrics	Not reported	Precision, Recall	Not reported
No Text Needed: Forecasting MT Quality and Inequity from Fertility and Metadata Sep 5, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
Do LLMs Adhere to Label Definitions? Examining Their Receptivity to External Label Definitions Sep 2, 2025	Automatic Metrics	Not reported	Accuracy	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (11.9% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (2.4% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (9.5% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (26.2% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (11.9% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (2.4% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 2.4% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (11.9% coverage).
Annotation unit is under-specified (2.4% coverage).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (BrowseComp vs DROP) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: BrowseComp Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 2.4% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (11.9% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (18)
Llm As Judge (2)
Simulation Env (2)

Top Metrics

Accuracy (6)
Cost (2)
F1 (2)
Calibration error (1)

Top Benchmarks

BrowseComp (1)
DROP (1)
MMLU (1)
Needle In A Haystack (1)

Quality Controls

Calibration (1)

Papers In This Archive Slice

DreamAudio: Customized Text-to-Audio Generation with Diffusion Models
Yi Yuan, Xubo Liu, Haohe Liu, Xiyuan Kang, Zhuo Chen · Sep 7, 2025 · Citations: 0
TSPC: A Two-Stage Phoneme-Centric Architecture for code-switching Vietnamese-English Speech Recognition
Tran Nguyen Anh, Truong Dinh Dung, Vo Van Nam, Minh N. H. Nguyen · Sep 7, 2025 · Citations: 0
QCSE: A Pretrained Quantum Context-Sensitive Word Embedding for Natural Language Processing
Charles M. Varmantchaonala, Niclas Götting, Nils-Erik Schütte, Jean Louis E. K. Fendji, Christopher Gies · Sep 6, 2025 · Citations: 0

To evaluate the effectiveness of the model and the associated context matrix methods, evaluations are conducted on both a Fulani corpus, a low-resource African language, dataset of small size and an English corpus of slightly larger size.
New Insights into Optimal Alignment of Acoustic and Linguistic Representations for Knowledge Transfer in ASR
Xugang Lu, Peng Shen, Hisashi Kawai · Sep 6, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
BinaryShield: Cross-Service Threat Intelligence in LLM Services using Privacy-Preserving Fingerprints
Waris Gill, Natalie Isak, Matthew Dressman · Sep 6, 2025 · Citations: 0
No Text Needed: Forecasting MT Quality and Inequity from Fertility and Metadata
Jessica M. Lundin, Ada Zhang, David Adelani, Cody Carroll · Sep 5, 2025 · Citations: 0

Using only a handful of features, token fertility ratios, token counts, and basic linguistic metadata (language family, script, and region), we can forecast ChrF scores for GPT-4o translations across 203 languages in the FLORES-200…
Post-training Large Language Models for Diverse High-Quality Responses
Yilei Chen, Souradip Chakraborty, Lorenz Wolf, Yannis Paschalidis, Aldo Pacchiano · Sep 5, 2025 · Citations: 0
Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios
Jingen Qu, Lijun Li, Bo Zhang, Yichen Yan, Jing Shao · Sep 4, 2025 · Citations: 0

Multimodal large language models (MLLMs) are rapidly evolving, presenting increasingly complex safety challenges.
From Editor to Dense Geometry Estimator
JiYuan Wang, Chunyu Lin, Lei Sun, Rongying Liu, Lang Nie · Sep 4, 2025 · Citations: 0
MultiWikiQA: A Reading Comprehension Benchmark in 300+ Languages
Dan Saattrup Smart · Sep 4, 2025 · Citations: 0
CausalARC: Abstract Reasoning with Causal World Models
Jacqueline Maasch, John Kalantari, Kia Khezeli · Sep 3, 2025 · Citations: 0

Demonstrations

As a proof-of-concept, we illustrate the use of CausalARC for four language model evaluation settings: (1) abstract reasoning with test-time training, (2) counterfactual reasoning with in-context learning, (3) program synthesis, and (4)…
Do Language Models Follow Occam's Razor? An Evaluation of Parsimony in Inductive and Abductive Reasoning
Yunxin Sun, Abulhair Saparov · Sep 3, 2025 · Citations: 0

The task for the intelligent agent is to produce hypotheses to explain observations under a given world model.
Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection
Shan Wang, Maying Shen, Nadine Chang, Chuong Nguyen, Hongdong Li · Sep 3, 2025 · Citations: 0

Experiments across multiple benchmarks demonstrate that GACD effectively reduces hallucinations and improves the visual grounding of MLLM outputs.
Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR
Jiaming Li, Longze Chen, Ze Gong, Yukun Chen, Lu Wang · Sep 2, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Do LLMs Adhere to Label Definitions? Examining Their Receptivity to External Label Definitions
Seyedali Mohammadi, Bhaskara Hanuma Vedula, Hemank Lamba, Edward Raff, Ponnurangam Kumaraguru · Sep 2, 2025 · Citations: 0

To address these questions, we conduct controlled experiments across multiple explanation benchmark datasets (general and domain-specific) and label definition conditions, including expert-curated, LLM-generated, perturbed, and swapped…
From Noisy Labels to Intrinsic Structure: A Geometric-Structural Dual-Guided Framework for Noise-Robust Medical Image Segmentation
Tao Wang, Zhenxuan Zhang, Yuanbo Zhou, Xinlin Zhang, Yuanbin Chen · Sep 2, 2025 · Citations: 0
BioBlue: Systematic runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format
Roland Pihlakas, Sruthi Susan Kuriakose · Sep 2, 2025 · Citations: 0

Long Horizon

Many AI alignment discussions of "runaway optimisation" focus on RL agents: unbounded utility maximisers that over-optimise a proxy objective (e.g., "paperclip maximiser", specification gaming) at the expense of everything else.
CMRAG: Co-modality-based visual document retrieval and question answering
Wang Chen, Wenhan Yu, Guanqiang Qi, Weikang Li, Yang Li · Sep 2, 2025 · Citations: 0

Experiments demonstrate that our proposed framework consistently outperforms single-modality--based RAG in multiple visual document question-answering (VDQA) benchmarks.
End-to-End Low-Level Neural Control of an Industrial-Grade 6D Magnetic Levitation System
Philipp Hartmann, Jannick Stranghöner, Klaus Neumann · Sep 1, 2025 · Citations: 0

Demonstrations

Magnetic levitation is poised to revolutionize industrial automation by integrating flexible in-machine product transport and seamless manipulation.
Error Notebook-Guided, Training-Free Part Retrieval in 3D CAD Assemblies via Vision-Language Models
Yunqing Liu, Nan Zhang, Zhiming Tan · Sep 1, 2025 · Citations: 0

Pairwise Preference Long Horizon

We additionally contribute a CAD dataset with human preference annotations.
TempCore: Are Video QA Benchmarks Temporally Grounded? A Frame Selection Sensitivity Analysis and Benchmark
Hyunjong Ok, Jaeho Lee · Sep 1, 2025 · Citations: 0

But do current Video QA benchmarks genuinely require temporal frame selection, or can most questions be answered regardless of which frames are shown?
L-MARS: Legal Multi-Agent Workflow with Orchestrated Reasoning and Agentic Search
Ziqi Wang, Boqin Yuan · Aug 31, 2025 · Citations: 0

Multi Agent

We present L-MARS (Legal Multi-Agent Workflow with Orchestrated Reasoning and Agentic Search), a multi-agent retrieval framework for grounded legal question answering that decomposes queries into structured sub-problems, retrieves evidence…
When Thinking Backfires: Mechanistic Insights Into Reasoning-Induced Misalignment
Hanqi Yan, Hainiu Xu, Siya Qi, Shu Yang, Yulan He · Aug 30, 2025 · Citations: 0
Estimating Parameter Fields in Multi-Physics PDEs from Scarce Measurements
Xuyang Li, Mahdi Masmoudi, Rami Gharbi, Nizar Lajnef, Vishnu Naresh Boddeti · Aug 29, 2025 · Citations: 0
On the Theoretical Limitations of Embedding-Based Retrieval
Orion Weller, Michael Boratko, Iftekhar Naim, Jinhyuk Lee · Aug 28, 2025 · Citations: 0

These new benchmarks push embeddings to work for any query and any notion of relevance that could be given.
EO-1: An Open Unified Embodied Foundation Model for General Robot Control
Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao · Aug 28, 2025 · Citations: 0

Long Horizon

The human ability to seamlessly perform multimodal reasoning and physical interaction in the open world is a core goal for general purpose embodied intelligent systems.
AVIATOR: Towards AI-Agentic Vulnerability Injection Workflow for High-Fidelity, Large-Scale Code Security Dataset
Amine Lbath, Massih-Reza Amini, Aurelien Delaitre, Vadim Okun · Aug 28, 2025 · Citations: 0

In this paper, we present AVIATOR, the first AI-agentic vulnerability injection framework.
From Guidelines to Guarantees: A Graph-Based Evaluation Harness for Domain-Specific Evaluation of LLMs
Jessica M. Lundin, Usman Nasir Nakakana, Guillaume Chabot-Couture · Aug 28, 2025 · Citations: 0

Rigorous evaluation of domain-specific language models requires benchmarks that are comprehensive, contamination-resistant, and maintainable.
Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP
Lorenz Hufe, Constantin Venhoff, Erblina Purelku, Maximilian Dreyer, Sebastian Lapuschkin · Aug 28, 2025 · Citations: 0

Red Team

These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.
NPG-Muse: Scaling Long Chain-of-Thought Reasoning with NP-Hard Graph Problems
Yuyao Wang, Bowen Liu, Jianheng Tang, Nuo Chen, Yuhan Li · Aug 28, 2025 · Citations: 0

However, developing these Long CoT behaviors relies heavily on post-training with high-quality datasets, which are typically costly and human-curated (e.g., mathematics and code), leaving scalable alternatives unexplored.
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios
Lisa Alazraki, Lihu Chen, Ana Brassard, Joe Stacey, Hossein A. Rahmani · Aug 27, 2025 · Citations: 0

In this work, we introduce an Agentic Commonsense and Math benchmark (AgentCoMa), where each compositional task requires a commonsense reasoning step and a math reasoning step.
Diffusion Language Models Know the Answer Before Decoding
Pengxiang Li, Yefan Zhou, Dilxat Muhtar, Lu Yin, Shilin Yan · Aug 27, 2025 · Citations: 0

Empirical evaluations of LLaDA-8B and Dream-7B across multiple tasks show that Prophet reduces the number of decoding steps by up to 3.4x while preserving high generation quality.
Your AI Bosses Are Still Prejudiced: The Emergence of Stereotypes in LLM-Based Multi-Agent Systems
Jingyu Guo, Yingying Xu · Aug 27, 2025 · Citations: 0

Multi Agent

While stereotypes are well-documented in human social interactions, AI systems are often presumed to be less susceptible to such biases.
The Information Dynamics of Generative Diffusion
Dejan Stancevic, Luca Ambrogioni · Aug 27, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Language and Experience: A Computational Model of Social Learning in Complex Tasks
Cédric Colas, Tracey Mills, Ben Prystawski, Michael Henry Tessler, Noah Goodman · Aug 26, 2025 · Citations: 0

The ability to combine linguistic guidance from others with direct experience is central to human development, enabling safe and rapid learning in new environments.
Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning
Dayoon Ko, Jihyuk Kim, Haeju Park, Sohyeon Kim, Dahyun Lee · Aug 26, 2025 · Citations: 0

Long Horizon

Large reasoning models (LRMs) combined with retrieval-augmented generation (RAG) have enabled deep research agents capable of multi-step reasoning with external knowledge retrieval.
LaTeXTrans: Structured LaTeX Translation with Multi-Agent Coordination
Ziming Zhu, Chenglong Wang, Haosong Xv, Shunjie Xing, Yifu Huo · Aug 26, 2025 · Citations: 0

Demonstrations Multi Agent

In this paper, we introduce LaTeXTrans, a collaborative multi-agent system designed to address this challenge.
VistaWise: Building Cost-Effective Agent with Cross-Modal Knowledge Graph for Minecraft
Honghao Fu, Junlong Ren, Qi Chai, Deheng Ye, Yujun Cai · Aug 26, 2025 · Citations: 0
Latent Self-Consistency for Reliable Majority-Set Selection in Short- and Long-Answer Reasoning
Jungsuk Oh, Jay-Yoon Lee · Aug 25, 2025 · Citations: 0
Why Synthetic Isn't Real Yet: A Diagnostic Framework for Contact Center Dialogue Generation
Rishikesh Devanathan, Varun Nathan, Ayush Kumar · Aug 25, 2025 · Citations: 0

In this work, we benchmark multiple generation strategies guided by structured supervision on call attributes (Intent Summaries, Topic Flows, and Quality Assurance (QA) Forms) across multiple languages.
Agri-Query: A Case Study on RAG vs. Long-Context LLMs for Cross-Lingual Technical Question Answering
Julius Gun, Timo Oksanen · Aug 25, 2025 · Citations: 0

Our benchmark is built on a user manual for an agricultural machine, available in English, French, and German.
How Quantization Shapes Bias in Large Language Models
Federico Marcuzzi, Xuefei Ning, Roy Schwartz, Iryna Gurevych · Aug 25, 2025 · Citations: 0

This work presents a comprehensive evaluation of how quantization affects model bias, with particular attention to its impact on individual demographic subgroups.

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now