HFEPX Archive Slice

HFEPX Daily Archive: 2025-09-30

Updated from current HFEPX corpus (Mar 10, 2026). 8 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Mar 10, 2026). 8 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Multi Dim Rubric. Frequent quality control: Inter Annotator Agreement Reported. Frequently cited benchmark: Aurora-Bench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Sep 30, 2025.

Papers: 8 Last published: Sep 30, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Medium .

High-Signal Coverage

100.0%

8 / 8 papers are not low-signal flagged.

Benchmark Anchors

25.0%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

25.0%

Papers with reported metric mentions in extraction output.

1 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Get this digest every Friday →

Why This Time Slice Matters

50% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 25% of papers in this hub.
Aurora-Bench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is inter-annotator agreement reporting (12.5% of papers).
Rater context is mostly domain experts, and annotation is commonly multi-dimensional rubrics; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages
Sep 30, 2025 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Agreement
PrefDisco: Benchmarking Proactive Personalized Reasoning
Sep 30, 2025 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: Accuracy
EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing
Sep 30, 2025 · Citations: 0 · Score: 5.0

Eval: Llm As Judge · Metrics: Not reported
BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses
Sep 30, 2025 · Citations: 0 · Score: 3.0

Eval: Not reported · Metrics: Not reported
ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation
Sep 30, 2025 · Citations: 0 · Score: 2.5

Eval: Not reported · Metrics: Not reported
Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents
Sep 30, 2025 · Citations: 0 · Score: 1.5

Eval: Not reported · Metrics: Not reported

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages Sep 30, 2025	Automatic Metrics	Not reported	Agreement	Inter Annotator Agreement Reported
PrefDisco: Benchmarking Proactive Personalized Reasoning Sep 30, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing Sep 30, 2025	Llm As Judge	Genai Bench, Aurora Bench	Not reported	Not reported
BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses Sep 30, 2025	Not reported	Biasfreebench	Not reported	Not reported
ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation Sep 30, 2025	Not reported	Not reported	Not reported	Not reported
Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents Sep 30, 2025	Not reported	Not reported	Not reported	Not reported
Latent Thinking Optimization: Your Latent Reasoning Language Model Secretly Encodes Reward Signals in Its Latent Thoughts Sep 30, 2025	Not reported	Not reported	Not reported	Not reported
LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts Sep 30, 2025	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Strong: Papers with explicit human feedback

Coverage is strong (50% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (12.5% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (12.5% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (25% vs 35% target).
Strong: Papers with known rater population

Coverage is strong (50% vs 35% target).
Moderate: Papers with known annotation unit

Coverage is usable but incomplete (25% vs 35% target).

Strengths

Strong human-feedback signal (50% of papers).

Known Gaps

Only 12.5% of papers report quality controls; prioritize calibration/adjudication evidence.
Benchmark coverage is thin (12.5% of papers mention benchmarks/datasets).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (Aurora-Bench vs Editreward-Bench) before comparing methods.
Track metric sensitivity by reporting both accuracy and agreement.

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: Aurora-Bench Metric Slice: accuracy IAA-Reported Evaluations Recent High-Signal Papers

Known Limitations

Only 12.5% of papers report quality controls; prioritize calibration/adjudication evidence.
Benchmark coverage is thin (12.5% of papers mention benchmarks/datasets).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (2)
Llm As Judge (1)

Top Metrics

Accuracy (1)
Agreement (1)

Top Benchmarks

Aurora Bench (1)
Editreward Bench (1)
Genai Bench (1)

Quality Controls

Inter Annotator Agreement Reported (1)

Papers In This Archive Slice

BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses
Xin Xu, Xunzhi He, Churan Zhi, Ruizhe Chen, Julian McAuley · Sep 30, 2025 · Citations: 0

Moreover, their evaluations are mostly based on the comparison between LLMs' probabilities of biased and unbiased contexts, which ignores the gap between such evaluations and real-world use cases where users interact with LLMs by reading…
PrefDisco: Benchmarking Proactive Personalized Reasoning
Shuyue Stella Li, Avinandan Bose, Faeze Brahman, Simon Shaolei Du, Pang Wei Koh · Sep 30, 2025 · Citations: 0

Pairwise PreferenceRubric Rating

We introduce PrefDisco, an evaluation methodology that transforms static benchmarks into interactive personalization tasks using psychologically-grounded personas with sparse, context-dependent preferences, and define PrefAlign as a…
MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages
Chenxi Whitehouse, Sebastian Ruder, Tony Lin, Oksana Kurylo, Haruka Takagi · Sep 30, 2025 · Citations: 0

Pairwise PreferenceRubric Rating

To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms.
Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents
Shuai Shao, Qihan Ren, Chen Qian, Boyi Wei, Dadi Guo · Sep 30, 2025 · Citations: 0

Advances in Large Language Models (LLMs) have enabled a new class of self-evolving agents that autonomously improve through interaction with the environment, demonstrating strong capabilities.
EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing
Keming Wu, Sicong Jiang, Max Ku, Ping Nie, Minghao Liu · Sep 30, 2025 · Citations: 0

Pairwise Preference

To address this critical bottleneck, we built EditReward, trained with our new large-scale human preference dataset, meticulously annotated by trained experts following a rigorous protocol containing over 200K preference pairs.
Latent Thinking Optimization: Your Latent Reasoning Language Model Secretly Encodes Reward Signals in Its Latent Thoughts
Hanwen Du, Yuxin Dong, Xia Ning · Sep 30, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation
Edoardo Bianchi, Jacopo Staiano, Antonio Liotta · Sep 30, 2025 · Citations: 0

Critique Edit

ProfVLM leverages conditional language generation to provide actionable insights along with quantitative evaluation scores.
LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts
Yuan Zhuang, Yi Shen, Yuexin Bian, Qing Su, Shihao Ji · Sep 30, 2025 · Citations: 0

Extensive experiments on the Qwen3-1.7B and Llama-3.2-3B models show that LD-MoLE achieves the highest average scores compared to state-of-the-art baselines, across a diverse set of benchmarks.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote