HFEPX Archive Slice

HFEPX Weekly Archive: 2025-W36

Updated from current HFEPX corpus (Mar 8, 2026). 12 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Mar 8, 2026). 12 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Sep 7, 2025.

Papers: 12 Last published: Sep 7, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Medium .

High-Signal Coverage

100.0%

12 / 12 papers are not low-signal flagged.

Benchmark Anchors

0.0%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

33.3%

Papers with reported metric mentions in extraction output.

0 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Why This Time Slice Matters

8.3% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 33.3% of papers in this hub.
long-horizon tasks appears in 16.7% of papers, indicating agentic evaluation demand.

Protocol Takeaways For This Period

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Error Notebook-Guided, Training-Free Part Retrieval in 3D CAD Assemblies via Vision-Language Models
Sep 1, 2025 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Accuracy
New Insights into Optimal Alignment of Acoustic and Linguistic Representations for Knowledge Transfer in ASR
Sep 6, 2025 · Citations: 0 · Score: 3.5

Eval: Automatic Metrics · Metrics: Precision, Recall
No Text Needed: Forecasting MT Quality and Inequity from Fertility and Metadata
Sep 5, 2025 · Citations: 0 · Score: 3.5

Eval: Automatic Metrics · Metrics: Accuracy
Do LLMs Adhere to Label Definitions? Examining Their Receptivity to External Label Definitions
Sep 2, 2025 · Citations: 0 · Score: 3.5

Eval: Automatic Metrics · Metrics: Accuracy
Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios
Sep 4, 2025 · Citations: 0 · Score: 2.0

Eval: Llm As Judge · Metrics: Not reported
BioBlue: Systematic runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format
Sep 2, 2025 · Citations: 0 · Score: 2.0

Eval: Simulation Env · Metrics: Not reported

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Error Notebook-Guided, Training-Free Part Retrieval in 3D CAD Assemblies via Vision-Language Models Sep 1, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
New Insights into Optimal Alignment of Acoustic and Linguistic Representations for Knowledge Transfer in ASR Sep 6, 2025	Automatic Metrics	Not reported	Precision, Recall	Not reported
No Text Needed: Forecasting MT Quality and Inequity from Fertility and Metadata Sep 5, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
Do LLMs Adhere to Label Definitions? Examining Their Receptivity to External Label Definitions Sep 2, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios Sep 4, 2025	Llm As Judge	Not reported	Not reported	Not reported
BioBlue: Systematic runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format Sep 2, 2025	Simulation Env	Not reported	Not reported	Not reported
Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection Sep 3, 2025	Not reported	Not reported	Not reported	Not reported
Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR Sep 2, 2025	Not reported	Not reported	Not reported	Not reported
TSPC: A Two-Stage Phoneme-Centric Architecture for code-switching Vietnamese-English Speech Recognition Sep 7, 2025	Not reported	Not reported	Not reported	Not reported
BinaryShield: Cross-Service Threat Intelligence in LLM Services using Privacy-Preserving Fingerprints Sep 6, 2025	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (8.3% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (0% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (25% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (8.3% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (8.3% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (8.3% coverage).
Annotation unit is under-specified (8.3% coverage).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Track metric sensitivity by reporting both accuracy and error rate.

Recommended Queries

LLM-as-Judge Protocols Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (8.3% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (4)
Llm As Judge (1)
Simulation Env (1)

Top Metrics

Accuracy (1)
Error rate (1)
F1 (1)
Jailbreak success rate (1)

Top Benchmarks

Quality Controls

Papers In This Archive Slice

TSPC: A Two-Stage Phoneme-Centric Architecture for code-switching Vietnamese-English Speech Recognition
Tran Nguyen Anh, Truong Dinh Dung, Vo Van Nam, Minh N. H. Nguyen · Sep 7, 2025 · Citations: 0
New Insights into Optimal Alignment of Acoustic and Linguistic Representations for Knowledge Transfer in ASR
Xugang Lu, Peng Shen, Hisashi Kawai · Sep 6, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
BinaryShield: Cross-Service Threat Intelligence in LLM Services using Privacy-Preserving Fingerprints
Waris Gill, Natalie Isak, Matthew Dressman · Sep 6, 2025 · Citations: 0
No Text Needed: Forecasting MT Quality and Inequity from Fertility and Metadata
Jessica M. Lundin, Ada Zhang, David Adelani, Cody Carroll · Sep 5, 2025 · Citations: 0

Using only a handful of features, token fertility ratios, token counts, and basic linguistic metadata (language family, script, and region), we can forecast ChrF scores for GPT-4o translations across 203 languages in the FLORES-200…
Post-training Large Language Models for Diverse High-Quality Responses
Yilei Chen, Souradip Chakraborty, Lorenz Wolf, Yannis Paschalidis, Aldo Pacchiano · Sep 5, 2025 · Citations: 0
Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios
Jingen Qu, Lijun Li, Bo Zhang, Yichen Yan, Jing Shao · Sep 4, 2025 · Citations: 0

Multimodal large language models (MLLMs) are rapidly evolving, presenting increasingly complex safety challenges.
MultiWikiQA: A Reading Comprehension Benchmark in 300+ Languages
Dan Saattrup Smart · Sep 4, 2025 · Citations: 0
Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection
Shan Wang, Maying Shen, Nadine Chang, Chuong Nguyen, Hongdong Li · Sep 3, 2025 · Citations: 0

Experiments across multiple benchmarks demonstrate that GACD effectively reduces hallucinations and improves the visual grounding of MLLM outputs.
Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR
Jiaming Li, Longze Chen, Ze Gong, Yukun Chen, Lu Wang · Sep 2, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Do LLMs Adhere to Label Definitions? Examining Their Receptivity to External Label Definitions
Seyedali Mohammadi, Bhaskara Hanuma Vedula, Hemank Lamba, Edward Raff, Ponnurangam Kumaraguru · Sep 2, 2025 · Citations: 0

To address these questions, we conduct controlled experiments across multiple explanation benchmark datasets (general and domain-specific) and label definition conditions, including expert-curated, LLM-generated, perturbed, and swapped…
BioBlue: Systematic runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format
Roland Pihlakas, Sruthi Susan Kuriakose · Sep 2, 2025 · Citations: 0

Long Horizon

Many AI alignment discussions of "runaway optimisation" focus on RL agents: unbounded utility maximisers that over-optimise a proxy objective (e.g., "paperclip maximiser", specification gaming) at the expense of everything else.
Error Notebook-Guided, Training-Free Part Retrieval in 3D CAD Assemblies via Vision-Language Models
Yunqing Liu, Nan Zhang, Zhiming Tan · Sep 1, 2025 · Citations: 0

Pairwise Preference Long Horizon

We additionally contribute a CAD dataset with human preference annotations.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote