HFEPX Archive Slice

HFEPX Daily Archive: 2025-12-18

Updated from current HFEPX corpus (Mar 10, 2026). 6 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Mar 10, 2026). 6 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Llm As Judge. Frequently cited benchmark: Jailbreakbench. Common metric signal: cost. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Dec 18, 2025.

Papers: 6 Last published: Dec 18, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Developing .

High-Signal Coverage

100.0%

6 / 6 papers are not low-signal flagged.

Benchmark Anchors

33.3%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

66.7%

Papers with reported metric mentions in extraction output.

0 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Get this digest every Friday →

Why This Time Slice Matters

33.3% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 66.7% of papers in this hub.
Jailbreakbench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics
Dec 18, 2025 · Citations: 0 · Score: 5.5

Eval: Llm As Judge · Metrics: Not reported
Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skills
Dec 18, 2025 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Cost
Knowledge Distillation with Structured Chain-of-Thought for Text-to-SQL
Dec 18, 2025 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Cost
In-Context Algebra
Dec 18, 2025 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Accuracy
A Domain-Adapted Pipeline for Structured Information Extraction from Police Incident Announcements on Social Media
Dec 18, 2025 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Accuracy, Exact match
Agent Tools Orchestration Leaks More: Dataset, Benchmark, and Mitigation
Dec 18, 2025 · Citations: 0 · Score: 3.5

Eval: Not reported · Metrics: Not reported

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics Dec 18, 2025	Llm As Judge	Jailbreakbench	Not reported	Not reported
Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skills Dec 18, 2025	Automatic Metrics	Not reported	Cost	Not reported
Knowledge Distillation with Structured Chain-of-Thought for Text-to-SQL Dec 18, 2025	Automatic Metrics	Not reported	Cost	Not reported
In-Context Algebra Dec 18, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
A Domain-Adapted Pipeline for Structured Information Extraction from Police Incident Announcements on Social Media Dec 18, 2025	Automatic Metrics	Not reported	Accuracy, Exact match	Not reported
Agent Tools Orchestration Leaks More: Dataset, Benchmark, and Mitigation Dec 18, 2025	Not reported	Top Bench	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Moderate: Papers with explicit human feedback

Coverage is usable but incomplete (33.3% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (16.7% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (16.7% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (0% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (0% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Annotation unit is under-specified (0% coverage).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: Jailbreakbench Metric Slice: cost Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (4)
Llm As Judge (1)

Top Metrics

Cost (1)

Top Benchmarks

Jailbreakbench (1)

Quality Controls

Papers In This Archive Slice

Knowledge Distillation with Structured Chain-of-Thought for Text-to-SQL
Khushboo Thaker, Yony Bresler · Dec 18, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
In-Context Algebra
Eric Todd, Jannik Brinkmann, Rohit Gandikota, David Bau · Dec 18, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics
Iker García-Ferrero, David Montero, Roman Orus · Dec 18, 2025 · Citations: 0

Red Team

We replace fragile pattern-based refusal detection with an LLM-as-a-judge that assigns refusal confidence scores and we propose a ridge-regularized variant to compute steering vectors that better isolate the refusal--compliance direction.
Agent Tools Orchestration Leaks More: Dataset, Benchmark, and Mitigation
Yuxuan Qiao, Dongqin Liu, Hongchang Yang, Wei Zhou, Songlin Hu · Dec 18, 2025 · Citations: 0

Evaluation of six state-of-the-art LLMs reveals pervasive risk: the average Overall Leakage Rate reaches 62.11% with an H-Score of only 52.90%.
Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skills
Pengcheng Jiang, Jiacheng Lin, Zhiyi Shi, Zifeng Wang, Luxi He · Dec 18, 2025 · Citations: 0

Pairwise Preference Tool Use

Large language model (LLM) agents are moving beyond prompting alone.
A Domain-Adapted Pipeline for Structured Information Extraction from Police Incident Announcements on Social Media
Mengfan Shen, Kangqi Song, Xindi Wang, Wei Jia, Tao Wang · Dec 18, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote