HFEPX Archive Slice

HFEPX Weekly Archive: 2025-W51

Updated from current HFEPX corpus (Mar 1, 2026). 6 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Mar 1, 2026). 6 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Llm As Judge. Frequently cited benchmark: BrowseComp. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Dec 20, 2025.

Papers: 6 Last published: Dec 20, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Developing .

High-Signal Coverage

100.0%

6 / 6 papers are not low-signal flagged.

Benchmark Anchors

33.3%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

83.3%

Papers with reported metric mentions in extraction output.

0 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Why This Slice Matters (Expanded)

Why This Time Slice Matters

16.7% of papers report explicit human-feedback signals, led by red-team protocols.
automatic metrics appears in 83.3% of papers in this hub.
BrowseComp is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Notes (Expanded)

Protocol Takeaways For This Period

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (BrowseComp vs Jailbreakbench) before comparing methods.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Towards Efficient Agents: A Co-Design of Inference Architecture and System
Dec 20, 2025 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy, Latency
Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics
Dec 18, 2025 · Citations: 0 · Score: 5.5

Eval: Llm As Judge · Metrics: Not reported
Knowledge Distillation with Structured Chain-of-Thought for Text-to-SQL
Dec 18, 2025 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Cost
In-Context Algebra
Dec 18, 2025 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Accuracy
A Domain-Adapted Pipeline for Structured Information Extraction from Police Incident Announcements on Social Media
Dec 18, 2025 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Accuracy, Exact match
Imitation Game: Reproducing Deep Learning Bugs Leveraging an Intelligent Agent
Dec 17, 2025 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Success rate

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Towards Efficient Agents: A Co-Design of Inference Architecture and System Dec 20, 2025	Automatic Metrics	BrowseComp	Accuracy, Latency	Not reported
Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics Dec 18, 2025	Llm As Judge	Jailbreakbench	Not reported	Not reported
Knowledge Distillation with Structured Chain-of-Thought for Text-to-SQL Dec 18, 2025	Automatic Metrics	Not reported	Cost	Not reported
In-Context Algebra Dec 18, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
A Domain-Adapted Pipeline for Structured Information Extraction from Police Incident Announcements on Social Media Dec 18, 2025	Automatic Metrics	Not reported	Accuracy, Exact match	Not reported
Imitation Game: Reproducing Deep Learning Bugs Leveraging an Intelligent Agent Dec 17, 2025	Automatic Metrics	Not reported	Success rate	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (16.7% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (33.3% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (16.7% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (0% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (0% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Annotation unit is under-specified (0% coverage).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (BrowseComp vs Jailbreakbench) before comparing methods.
Track metric sensitivity by reporting both accuracy and latency.

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: BrowseComp Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (5)
Llm As Judge (1)

Top Metrics

Accuracy (1)
Latency (1)
Throughput (1)

Top Benchmarks

BrowseComp (1)
Jailbreakbench (1)

Quality Controls

Papers In This Archive Slice

Towards Efficient Agents: A Co-Design of Inference Architecture and System
Weizhe Lin, Hui-Ling Zhen, Shuai Yang, Xian Wang, Renxi Liu · Dec 20, 2025 · Citations: 0

Long Horizon

The rapid development of large language model (LLM)-based agents has unlocked new possibilities for autonomous multi-turn reasoning and tool-augmented decision-making.
Knowledge Distillation with Structured Chain-of-Thought for Text-to-SQL
Khushboo Thaker, Yony Bresler · Dec 18, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
In-Context Algebra
Eric Todd, Jannik Brinkmann, Rohit Gandikota, David Bau · Dec 18, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics
Iker García-Ferrero, David Montero, Roman Orus · Dec 18, 2025 · Citations: 0

Red Team

We replace fragile pattern-based refusal detection with an LLM-as-a-judge that assigns refusal confidence scores and we propose a ridge-regularized variant to compute steering vectors that better isolate the refusal--compliance direction.
A Domain-Adapted Pipeline for Structured Information Extraction from Police Incident Announcements on Social Media
Mengfan Shen, Kangqi Song, Xindi Wang, Wei Jia, Tao Wang · Dec 18, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Imitation Game: Reproducing Deep Learning Bugs Leveraging an Intelligent Agent
Mehil B Shah, Mohammad Masudur Rahman, Foutse Khomh · Dec 17, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote