HFEPX Archive Slice

HFEPX Daily Archive: 2025-10-08

Updated from current HFEPX corpus (Apr 12, 2026). 12 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 12 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Frequent quality control: Adjudication. Frequently cited benchmark: AlpacaEval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Oct 8, 2025.

Papers: 12 Last published: Oct 8, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Medium .

High-Signal Coverage

100.0%

12 / 12 papers are not low-signal flagged.

Benchmark Anchors

16.7%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

33.3%

Papers with reported metric mentions in extraction output.

1 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Get this digest every Friday →

Why This Time Slice Matters

8.3% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 25% of papers in this hub.
AlpacaEval is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is adjudication (8.3% of papers).
Rater context is mostly domain experts, and annotation is commonly mixed annotation units; use this to scope replication staffing.
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

EconCausal: A Context-Aware Causal Reasoning Benchmark for Large Language Models in Social Science
Oct 8, 2025 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Accuracy, Cost
Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation
Oct 8, 2025 · Citations: 0 · Score: 3.5

Eval: Automatic Metrics · Metrics: Accuracy, Error rate
Multi-hop Deep Joint Source-Channel Coding with Deep Hash Distillation for Semantically Aligned Image Recovery
Oct 8, 2025 · Citations: 0 · Score: 3.5

Eval: Automatic Metrics · Metrics: Mse
FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline
Oct 8, 2025 · Citations: 0 · Score: 3.5

Eval: Human Eval · Metrics: Not reported
PIKA: Expert-Level Synthetic Datasets for Post-Training Alignment from Scratch
Oct 8, 2025 · Citations: 0 · Score: 3.5

Eval: Not reported · Metrics: Not reported
PATCH: Mitigating PII Leakage in Language Models with Privacy-Aware Targeted Circuit PatcHing
Oct 8, 2025 · Citations: 0 · Score: 2.5

Eval: Not reported · Metrics: Recall

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
EconCausal: A Context-Aware Causal Reasoning Benchmark for Large Language Models in Social Science Oct 8, 2025	Automatic Metrics	Not reported	Accuracy, Cost	Adjudication
Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation Oct 8, 2025	Automatic Metrics	Not reported	Accuracy, Error rate	Not reported
Multi-hop Deep Joint Source-Channel Coding with Deep Hash Distillation for Semantically Aligned Image Recovery Oct 8, 2025	Automatic Metrics	Not reported	Mse	Not reported
FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline Oct 8, 2025	Human Eval	Furina Bench	Not reported	Not reported
PIKA: Expert-Level Synthetic Datasets for Post-Training Alignment from Scratch Oct 8, 2025	Not reported	LMSYS Chatbot Arena, AlpacaEval	Not reported	Not reported
PATCH: Mitigating PII Leakage in Language Models with Privacy-Aware Targeted Circuit PatcHing Oct 8, 2025	Not reported	Not reported	Recall	Not reported
Biasless Language Models Learn Unnaturally: How LLMs Fail to Distinguish the Possible from the Impossible Oct 8, 2025	Not reported	Not reported	Not reported	Not reported
Search-R3: Unifying Reasoning and Embedding in Large Language Models Oct 8, 2025	Not reported	Not reported	Not reported	Not reported
LAD-RAG: Layout-aware Dynamic RAG for Visually-Rich Document Understanding Oct 8, 2025	Not reported	Not reported	Not reported	Not reported
Exposing Citation Vulnerabilities in Generative Engines Oct 8, 2025	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (8.3% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (8.3% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (25% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (25% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (8.3% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (0% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 8.3% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (8.3% coverage).
Annotation unit is under-specified (0% coverage).

Suggested Next Analyses

Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.
Stratify by benchmark (AlpacaEval vs AlpacaEval 2.0) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

Human Eval Protocols Benchmark Slice: AlpacaEval Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 8.3% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (8.3% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (3)
Human Eval (1)

Top Metrics

Accuracy (2)
Cost (1)
Latency (1)
Recall (1)

Top Benchmarks

AlpacaEval (1)
AlpacaEval 2.0 (1)
Arena Hard (1)
DocVQA (1)

Quality Controls

Adjudication (1)

Papers In This Archive Slice

PATCH: Mitigating PII Leakage in Language Models with Privacy-Aware Targeted Circuit PatcHing
Anthony Hughes, Vasisht Duddu, N. Asokan, Nikolaos Aletras, Ning Ma · Oct 8, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
LAD-RAG: Layout-aware Dynamic RAG for Visually-Rich Document Understanding
Zhivar Sourati, Zheng Wang, Marianne Menglin Liu, Yazhe Hu, Mengqing Guo · Oct 8, 2025 · Citations: 0
EconCausal: A Context-Aware Causal Reasoning Benchmark for Large Language Models in Social Science
Donggyu Lee, Hyeok Yun, Meeyoung Cha, Sungwon Park, Sangyoon Park · Oct 8, 2025 · Citations: 0

To address this, we introduce EconCausal, a large-scale benchmark comprising 10,490 context-annotated causal triplets extracted from 2,595 high-quality empirical studies published in top-tier economics and finance journals.
Biasless Language Models Learn Unnaturally: How LLMs Fail to Distinguish the Possible from the Impossible
Imry Ziv, Nur Lan, Emmanuel Chemla · Oct 8, 2025 · Citations: 0

Are large language models (LLMs) sensitive to the distinction between humanly possible and impossible languages?
Search-R3: Unifying Reasoning and Embedding in Large Language Models
Yuntao Gui, James Cheng · Oct 8, 2025 · Citations: 0

Our extensive evaluations on diverse benchmarks demonstrate that Search-R3 significantly outperforms prior methods by unifying the reasoning and embedding generation processes.
Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation
Vaibhav Srivastav, Steven Zheng, Eric Bezzam, Eustache Le Bihan, Nithin Rao Koluguri · Oct 8, 2025 · Citations: 0

We present the Open ASR Leaderboard, a reproducible benchmarking platform with community contributions from academia and industry.
Multi-hop Deep Joint Source-Channel Coding with Deep Hash Distillation for Semantically Aligned Image Recovery
Didrik Bergström, Deniz Gündüz, Onur Günlü · Oct 8, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Exposing Citation Vulnerabilities in Generative Engines
Riku Mochizuki, Shusuke Komatsu, Souta Noguchi, Kazuto Ataka · Oct 8, 2025 · Citations: 0
FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline
Haotian Wu, Shufan Jiang, Chios Chen, Yiyang Feng, Hehai Lin · Oct 8, 2025 · Citations: 0

Multi Agent

As large language models (LLMs) advance in role-playing (RP) tasks, existing benchmarks quickly become obsolete due to their narrow scope, outdated interaction paradigms, and limited adaptability across diverse application scenarios.
PTEB: Towards Robust Text Embedding Evaluation via Stochastic Paraphrasing at Evaluation Time with LLMs
Manuel Frank, Haithem Afli · Oct 8, 2025 · Citations: 0
PIKA: Expert-Level Synthetic Datasets for Post-Training Alignment from Scratch
Shangjian Yin, Shining Liang, Wenbiao Ding, Yuli Qian, Zhouxing Shi · Oct 8, 2025 · Citations: 0

Pairwise Preference

Despite its small size, fine-tuning Llama-3-8B-Base on PiKa-SFT even outperforms the official Llama-3-8B-Instruct model trained on over 10M proprietary examples on widely used benchmarks such as AlpacaEval 2.0 and Arena-Hard.
StaR-KVQA: Structured Reasoning Traces for Implicit-Knowledge Visual Question Answering
Zhihao Wen, Wenkang Wei, Yuan Fang, Xingtong Yu, Hui Zhang · Oct 8, 2025 · Citations: 0

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now