Daily Archive

HFEPX Fortnight Archive: 2025-F19

Updated from current HFEPX corpus (Feb 27, 2026). 10 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Frequent quality control: Calibration. Frequently cited benchmark: AdvBench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Sep 20, 2025.

Papers: 10 Last published: Sep 20, 2025 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 10 papers for HFEPX Fortnight Archive: 2025-F19. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on AdvBench, AIME and metric focus on accuracy, auroc. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

20% of papers report explicit human-feedback signals, led by critique/edit feedback.

Evidence: ClearFairy: Capturing Creative Workflows through Decision Structuring, In-Situ Questioning, and Rationale Inference , KANO: Kolmogorov-Arnold Neural Operator , Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation , ATTS: Asynchronous Test-Time Scaling via Conformal Prediction
automatic metrics appears in 90% of papers in this hub.

Evidence: KANO: Kolmogorov-Arnold Neural Operator , Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation , ATTS: Asynchronous Test-Time Scaling via Conformal Prediction , ClearFairy: Capturing Creative Workflows through Decision Structuring, In-Situ Questioning, and Rationale Inference
AdvBench is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness , KANO: Kolmogorov-Arnold Neural Operator , Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation , ATTS: Asynchronous Test-Time Scaling via Conformal Prediction

Protocol Takeaways

Most common quality-control signal is rater calibration (10% of papers).

Evidence: ATTS: Asynchronous Test-Time Scaling via Conformal Prediction , KANO: Kolmogorov-Arnold Neural Operator , Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation , ClearFairy: Capturing Creative Workflows through Decision Structuring, In-Situ Questioning, and Rationale Inference
Rater context is mostly domain experts, and annotation is commonly mixed annotation units; use this to scope replication staffing.

Evidence: PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams -- Dataset Construction and Evaluation , CogniAlign: Survivability-Grounded Multi-Agent Moral Reasoning for Safe and Transparent AI , MedicalPatchNet: A Patch-Based Self-Explainable AI Architecture for Chest X-ray Classification , KANO: Kolmogorov-Arnold Neural Operator
Stratify by benchmark (AdvBench vs AIME) before comparing methods.

Evidence: KANO: Kolmogorov-Arnold Neural Operator , Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation , ATTS: Asynchronous Test-Time Scaling via Conformal Prediction , ClearFairy: Capturing Creative Workflows through Decision Structuring, In-Situ Questioning, and Rationale Inference

Benchmark Interpretation

AdvBench appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.
AIME appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 30% of hub papers (3/10); compare with a secondary metric before ranking methods.
auroc is reported in 10% of hub papers (1/10); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (20% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (10% vs 30% target).
Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (30% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (50% vs 35% target).
Tighten coverage on Papers with known rater population. Coverage is usable but incomplete (30% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (0% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (20% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (10% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (30% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (50% vs 35% target).

Papers with known rater population

Coverage is usable but incomplete (30% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (0% vs 35% target).

Known Limitations

Only 10% of papers report quality controls; prioritize calibration/adjudication evidence.
Annotation unit is under-specified (0% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: AdvBench - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

automatic_metrics vs simulation_env

both=0, left_only=9, right_only=1

0 papers use both Automatic Metrics and Simulation Env.

Benchmark Brief

AdvBench

Coverage: 1 papers (10%)

1 papers (10%) mention AdvBench.

Examples: A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness

Benchmark Brief

AIME

Coverage: 1 papers (10%)

1 papers (10%) mention AIME.

Examples: ATTS: Asynchronous Test-Time Scaling via Conformal Prediction

Benchmark Brief

MATH

Coverage: 1 papers (10%)

1 papers (10%) mention MATH.

Examples: Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation

Metric Brief

accuracy

Coverage: 3 papers (30%)

3 papers (30%) mention accuracy.

Examples: ATTS: Asynchronous Test-Time Scaling via Conformal Prediction , PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams -- Dataset Construction and Evaluation , MedicalPatchNet: A Patch-Based Self-Explainable AI Architecture for Chest X-ray Classification

Metric Brief

auroc

Coverage: 1 papers (10%)

1 papers (10%) mention auroc.

Examples: MedicalPatchNet: A Patch-Based Self-Explainable AI Architecture for Chest X-ray Classification

Metric Brief

helpfulness

Coverage: 1 papers (10%)

1 papers (10%) mention helpfulness.

Examples: A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: KANO: Kolmogorov-Arnold Neural Operator , Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation , ATTS: Asynchronous Test-Time Scaling via Conformal Prediction

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Papers Published On This Date

KANO: Kolmogorov-Arnold Neural Operator
Jin Lee, Ziming Liu, Xinling Yu, Yixuan Wang, Haewon Jeong · Sep 20, 2025

In the quantum Hamiltonian learning benchmark, KANO reconstructs ground-truth Hamiltonians in closed-form symbolic representations accurate to the fourth decimal place in coefficients and attains $\approx 6\times10^{-6}$ state infidelity fr
Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation
Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti · Sep 18, 2025

Large language models (LLMs) are increasingly trained with reinforcement learning from verifiable rewards (RLVR), yet real-world deployment demands models that can self-improve without labels or external judges.
ATTS: Asynchronous Test-Time Scaling via Conformal Prediction
Jing Xiong, Qiujiang Chen, Fanghua Ye, Zhongwei Wan, Chuanyang Zheng · Sep 18, 2025

Large language models (LLMs) benefit from test-time scaling but are often hampered by high inference latency.
ClearFairy: Capturing Creative Workflows through Decision Structuring, In-Situ Questioning, and Rationale Inference
Kihoon Son, DaEun Choi, Tae Soo Kim, Young-Ho Kim, Sangdoo Yun · Sep 18, 2025

Critique Edit

Furthermore, exploratory applications demonstrate that captured steps can enhance generative AI agents in Figma, yielding predictions better aligned with professionals and producing coherent outcomes.
A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness
Xuan Luo, Yue Wang, Zefeng He, Geng Tu, Jing Li · Sep 17, 2025

Red Team

This study reveals a critical safety blind spot in modern LLMs: learning-style queries, which closely resemble ordinary educational questions, can reliably elicit harmful responses.
The AI Memory Gap: Users Misremember What They Created With AI or Without
Tim Zindulka, Sven Goller, Daniela Fernandes, Robin Welsch, Daniel Buschek · Sep 15, 2025

Our findings reveal a significant gap in memory: After AI use, the odds of correct attribution dropped, with the steepest decline in mixed human-AI workflows, where either the idea or elaboration was created with AI.
Collaborative Document Editing with Multiple Users and AI Agents
Florian Lehmann, Krystsina Shauchenka, Daniel Buschek · Sep 15, 2025

Multi Agent

We propose integrating AI agents directly into collaborative writing environments.
PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams -- Dataset Construction and Evaluation
Rodrigo M. Carrillo-Larco, Jesus Lovón Melgarejo, Manuel Castillo-Cara, Gusseppe Bravo-Rocca · Sep 15, 2025

BACKGROUND: Medical large language models (LLMs) have demonstrated remarkable performance in answering medical examinations.
CogniAlign: Survivability-Grounded Multi-Agent Moral Reasoning for Safe and Transparent AI
Hasin Jawad Ali, Ilhamul Azam, Ajwad Abrar, Md. Kamrul Hasan, Hasan Mahmud · Sep 14, 2025

Multi Agent

The challenge of aligning artificial intelligence (AI) with human values persists due to the abstract and often conflicting nature of moral principles and the opacity of existing approaches.
MedicalPatchNet: A Patch-Based Self-Explainable AI Architecture for Chest X-ray Classification
Patrick Wienholt, Christiane Kuhl, Jakob Nikolas Kather, Sven Nebelung, Daniel Truhn · Sep 9, 2025

Deep neural networks excel in radiological image classification but frequently suffer from poor interpretability, limiting clinical acceptance.

Recent Daily Archives

fortnight-2026-f04 (335) week-2026-w08 (288) quarter-2026-q1 (732) month-2026-02 (682) fortnight-2026-f05 (323) week-2026-w09 (323) 2026-02-23 (56) 2026-02-25 (89) 2026-02-24 (118) month-2026-01 (50) 2026-02-20 (37) week-2026-w07 (47) 2026-02-26 (60) quarter-2025-q4 (124) 2026-02-18 (56) 2026-02-21 (22) 2026-02-17 (54) 2026-02-19 (60) fortnight-2026-f02 (27) 2026-02-22 (23) 2026-02-16 (36) month-2025-10 (69) quarter-2025-q2 (78) fortnight-2026-f01 (16) fortnight-2026-f03 (34) quarter-2025-q3 (90) week-2026-w06 (23) month-2025-12 (30) fortnight-2025-f22 (32) quarter-2025-q1 (35) month-2025-06 (39) month-2025-09 (44) month-2025-11 (25) week-2026-w03 (15) week-2026-w04 (12) 2026-02-13 (8) fortnight-2025-f21 (32) 2026-02-15 (7) fortnight-2025-f20 (34) fortnight-2025-f12 (29)

HFEPX Fortnight Archive: 2025-F19

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Papers Published On This Date

Recent Daily Archives