Skip to content
← Back to explorer

HFEPX Archive Slice

HFEPX Daily Archive: 2025-11-18

Updated from current HFEPX corpus (Apr 9, 2026). 9 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 9, 2026). 9 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics. Common annotation unit: Ranking. Frequent quality control: Adjudication. Frequently cited benchmark: Finagentbench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Nov 18, 2025.

Papers: 9 Last published: Nov 18, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Medium .

High-Signal Coverage

100.0%

9 / 9 papers are not low-signal flagged.

Benchmark Anchors

33.3%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

66.7%

Papers with reported metric mentions in extraction output.

  • 1 papers report explicit quality controls for this archive period.
  • Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Get this digest every Friday →

Subscribe

Why This Time Slice Matters

  • automatic metrics appears in 66.7% of papers in this hub.
  • Finagentbench is a recurring benchmark anchor for cross-paper comparisons in this page.
  • multi-agent setups appears in 33.3% of papers, indicating agentic evaluation demand.

Protocol Takeaways For This Period

  • Most common quality-control signal is adjudication (11.1% of papers).
  • Rater context is mostly unspecified rater pools, and annotation is commonly ranking annotation; use this to scope replication staffing.
  • Stratify by benchmark (Finagentbench vs Financebench) before comparing methods.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper Eval Modes Benchmarks Metrics Quality Controls
Let the Model Distribute Its Doubt: Confidence Estimation through Verbalized Probability Distribution

Nov 18, 2025

Automatic Metrics MMLU, MMLU Pro Brier score Not reported
PRISM: Prompt-Refined In-Context System Modelling for Financial Retrieval

Nov 18, 2025

Automatic Metrics Finagentbench, Financebench Ndcg, Latency Not reported
Stealth Fine-Tuning: Efficiently Breaking Alignment in RVLMs Using Self-Generated CoT

Nov 18, 2025

Automatic Metrics AdvBench Cost, Jailbreak success rate Not reported
From Competition to Coordination: Market Making as a Scalable Framework for Safe and Aligned Multi-Agent LLM Systems

Nov 18, 2025

Automatic Metrics Not reported Accuracy Adjudication
SVBRD-LLM: Self-Verifying Behavioral Rule Discovery for Autonomous Vehicle Identification

Nov 18, 2025

Automatic Metrics Not reported Accuracy, F1 Not reported
Based on Data Balancing and Model Improvement for Multi-Label Sentiment Classification Performance Enhancement

Nov 18, 2025

Automatic Metrics Not reported Accuracy, F1 Not reported
Cheating Stereo Matching in Full-scale: Physical Adversarial Attack against Binocular Depth Estimation in Autonomous Driving

Nov 18, 2025

Not reported Not reported Not reported Not reported
AISAC: An Integrated multi-agent System for Transparent, Retrieval-Grounded Scientific Assistance

Nov 18, 2025

Not reported Not reported Not reported Not reported
FAPE-IR: Frequency-Aware Planning and Execution Framework for All-in-One Image Restoration

Nov 18, 2025

Not reported Not reported Not reported Not reported
Researcher Workflow (Detailed)

Checklist

  • Gap: Papers with explicit human feedback

    Coverage is a replication risk (0% vs 45% target).

  • Gap: Papers reporting quality controls

    Coverage is a replication risk (11.1% vs 30% target).

  • Gap: Papers naming benchmarks/datasets

    Coverage is a replication risk (11.1% vs 35% target).

  • Moderate: Papers naming evaluation metrics

    Coverage is usable but incomplete (22.2% vs 35% target).

  • Gap: Papers with known rater population

    Coverage is a replication risk (0% vs 35% target).

  • Gap: Papers with known annotation unit

    Coverage is a replication risk (11.1% vs 35% target).

Strengths

  • Agentic evaluation appears in 33.3% of papers.

Known Gaps

  • Only 11.1% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Rater population is under-specified (0% coverage).
  • Annotation unit is under-specified (11.1% coverage).

Suggested Next Analyses

  • Stratify by benchmark (Finagentbench vs Financebench) before comparing methods.
  • Track metric sensitivity by reporting both accuracy and cost.
  • Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

Known Limitations
  • Only 11.1% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Rater population is under-specified (0% coverage).
  • Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.
Research Utility Snapshot (Detailed)

Evaluation Modes

  • Automatic Metrics (6)

Top Metrics

  • Accuracy (1)
  • Cost (1)
  • Latency (1)
  • Ndcg (1)

Top Benchmarks

  • Finagentbench (1)
  • Financebench (1)

Quality Controls

  • Adjudication (1)

Papers In This Archive Slice

Recent Archive Slices

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.