HFEPX Archive Slice

HFEPX Daily Archive: 2025-11-18

Updated from current HFEPX corpus (Apr 9, 2026). 9 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 9, 2026). 9 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics. Common annotation unit: Ranking. Frequent quality control: Adjudication. Frequently cited benchmark: Finagentbench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Nov 18, 2025.

Papers: 9 Last published: Nov 18, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Medium .

High-Signal Coverage

100.0%

9 / 9 papers are not low-signal flagged.

Benchmark Anchors

33.3%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

66.7%

Papers with reported metric mentions in extraction output.

1 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Get this digest every Friday →

Why This Time Slice Matters

automatic metrics appears in 66.7% of papers in this hub.
Finagentbench is a recurring benchmark anchor for cross-paper comparisons in this page.
multi-agent setups appears in 33.3% of papers, indicating agentic evaluation demand.

Protocol Takeaways For This Period

Most common quality-control signal is adjudication (11.1% of papers).
Rater context is mostly unspecified rater pools, and annotation is commonly ranking annotation; use this to scope replication staffing.
Stratify by benchmark (Finagentbench vs Financebench) before comparing methods.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Let the Model Distribute Its Doubt: Confidence Estimation through Verbalized Probability Distribution
Nov 18, 2025 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Brier score
PRISM: Prompt-Refined In-Context System Modelling for Financial Retrieval
Nov 18, 2025 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Ndcg, Latency
Stealth Fine-Tuning: Efficiently Breaking Alignment in RVLMs Using Self-Generated CoT
Nov 18, 2025 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Cost, Jailbreak success rate
From Competition to Coordination: Market Making as a Scalable Framework for Safe and Aligned Multi-Agent LLM Systems
Nov 18, 2025 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: Accuracy
SVBRD-LLM: Self-Verifying Behavioral Rule Discovery for Autonomous Vehicle Identification
Nov 18, 2025 · Citations: 0 · Score: 4.0

Eval: Automatic Metrics · Metrics: Accuracy, F1
Based on Data Balancing and Model Improvement for Multi-Label Sentiment Classification Performance Enhancement
Nov 18, 2025 · Citations: 0 · Score: 4.0

Eval: Automatic Metrics · Metrics: Accuracy, F1

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Let the Model Distribute Its Doubt: Confidence Estimation through Verbalized Probability Distribution Nov 18, 2025	Automatic Metrics	MMLU, MMLU Pro	Brier score	Not reported
PRISM: Prompt-Refined In-Context System Modelling for Financial Retrieval Nov 18, 2025	Automatic Metrics	Finagentbench, Financebench	Ndcg, Latency	Not reported
Stealth Fine-Tuning: Efficiently Breaking Alignment in RVLMs Using Self-Generated CoT Nov 18, 2025	Automatic Metrics	AdvBench	Cost, Jailbreak success rate	Not reported
From Competition to Coordination: Market Making as a Scalable Framework for Safe and Aligned Multi-Agent LLM Systems Nov 18, 2025	Automatic Metrics	Not reported	Accuracy	Adjudication
SVBRD-LLM: Self-Verifying Behavioral Rule Discovery for Autonomous Vehicle Identification Nov 18, 2025	Automatic Metrics	Not reported	Accuracy, F1	Not reported
Based on Data Balancing and Model Improvement for Multi-Label Sentiment Classification Performance Enhancement Nov 18, 2025	Automatic Metrics	Not reported	Accuracy, F1	Not reported
Cheating Stereo Matching in Full-scale: Physical Adversarial Attack against Binocular Depth Estimation in Autonomous Driving Nov 18, 2025	Not reported	Not reported	Not reported	Not reported
AISAC: An Integrated multi-agent System for Transparent, Retrieval-Grounded Scientific Assistance Nov 18, 2025	Not reported	Not reported	Not reported	Not reported
FAPE-IR: Frequency-Aware Planning and Execution Framework for All-in-One Image Restoration Nov 18, 2025	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (0% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (11.1% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (11.1% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (22.2% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (0% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (11.1% vs 35% target).

Strengths

Agentic evaluation appears in 33.3% of papers.

Known Gaps

Only 11.1% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Annotation unit is under-specified (11.1% coverage).

Suggested Next Analyses

Stratify by benchmark (Finagentbench vs Financebench) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

Benchmark Slice: Finagentbench Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 11.1% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (6)

Top Metrics

Accuracy (1)
Cost (1)
Latency (1)
Ndcg (1)

Top Benchmarks

Finagentbench (1)
Financebench (1)

Quality Controls

Adjudication (1)

Papers In This Archive Slice

SVBRD-LLM: Self-Verifying Behavioral Rule Discovery for Autonomous Vehicle Identification
Xiangyu Li, Tianyi Wang, Junfeng Jiao, Christian Claudel, Zhaomiao Guo · Nov 18, 2025 · Citations: 0

As autonomous vehicles (AVs) are increasingly deployed on public roads, understanding their real-world behaviors is critical for traffic safety analysis and regulatory oversight.
From Competition to Coordination: Market Making as a Scalable Framework for Safe and Aligned Multi-Agent LLM Systems
Brendan Gho, Suman Muppavarapu, Afnan Shaik, Tyson Tsay, Atharva Mohan · Nov 18, 2025 · Citations: 0

Multi Agent

As foundation models are increasingly deployed as interacting agents in multi-agent systems, their collective behavior raises new challenges for trustworthiness, transparency, and accountability.
Cheating Stereo Matching in Full-scale: Physical Adversarial Attack against Binocular Depth Estimation in Autonomous Driving
Kangqiao Zhao, Shuo Huai, Xurui Song, Jun Luo · Nov 18, 2025 · Citations: 0

Extensive evaluations show that our PAEs can successfully fool the stereo models into producing erroneous depth information.
Let the Model Distribute Its Doubt: Confidence Estimation through Verbalized Probability Distribution
Ante Wang, Weizhi Ma, Yang Liu · Nov 18, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
PRISM: Prompt-Refined In-Context System Modelling for Financial Retrieval
Chun Chet Ng, Jia Yu Lim, Wei Zeng Low · Nov 18, 2025 · Citations: 0

Multi Agent

We present PRISM, a training-free framework that integrates refined system prompting, in-context learning (ICL), and lightweight multi-agent coordination for document and chunk ranking tasks.
Stealth Fine-Tuning: Efficiently Breaking Alignment in RVLMs Using Self-Generated CoT
Le Yu, Zhengyue Zhao, Yawen Zheng, Yunhao Liu · Nov 18, 2025 · Citations: 0

Reasoning-augmented Vision-Language Models (RVLMs) rely on safety alignment to prevent harmful behavior, yet their exposed chain-of-thought (CoT) traces introduce new attack surfaces.
FAPE-IR: Frequency-Aware Planning and Execution Framework for All-in-One Image Restoration
Jingren Liu, Shuning Xu, Qirui Yang, Yun Wang, Xiangyu Chen · Nov 18, 2025 · Citations: 0
Based on Data Balancing and Model Improvement for Multi-Label Sentiment Classification Performance Enhancement
Zijin Su, Huanzhu Lyu, Yuren Niu, Yiming Liu · Nov 18, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
AISAC: An Integrated multi-agent System for Transparent, Retrieval-Grounded Scientific Assistance
Chandrachur Bhattacharya, Sibendu Som · Nov 18, 2025 · Citations: 0

Long Horizon

AI Scientific Assistant Core (AISAC) is a transparent, modular multi-agent runtime developed at Argonne National Laboratory to support long-horizon, evidence-grounded scientific reasoning.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote