HFEPX Archive Slice

HFEPX Weekly Archive: 2025-W40

Updated from current HFEPX corpus (Mar 1, 2026). 19 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Mar 1, 2026). 19 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Ranking. Common metric signal: precision. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Oct 5, 2025.

Papers: 19 Last published: Oct 5, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Medium .

High-Signal Coverage

100.0%

19 / 19 papers are not low-signal flagged.

Benchmark Anchors

10.5%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

42.1%

Papers with reported metric mentions in extraction output.

0 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Why This Slice Matters (Expanded)

Why This Time Slice Matters

10.5% of papers report explicit human-feedback signals, led by demonstration data.
automatic metrics appears in 31.6% of papers in this hub.
precision is a repeated reporting metric here, enabling more consistent cross-paper score interpretation.

Protocol Notes (Expanded)

Protocol Takeaways For This Period

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity
Oct 5, 2025 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: Spearman
Token Hidden Reward: Steering Exploration-Exploitation in Group Relative Deep Reinforcement Learning
Oct 4, 2025 · Citations: 0 · Score: 4.0

Eval: Automatic Metrics · Metrics: Accuracy, Pass@k
Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval
Oct 3, 2025 · Citations: 0 · Score: 4.0

Eval: Automatic Metrics · Metrics: Ndcg
Generative Value Conflicts Reveal LLM Priorities
Sep 29, 2025 · Citations: 0 · Score: 4.0

Eval: Automatic Metrics · Metrics: Harmlessness
Incentive-Aligned Multi-Source LLM Summaries
Sep 29, 2025 · Citations: 0 · Score: 4.0

Eval: Automatic Metrics · Metrics: Accuracy
TimeOmni-1: Incentivizing Complex Reasoning with Time Series in Large Language Models
Sep 29, 2025 · Citations: 0 · Score: 4.0

Eval: Automatic Metrics · Metrics: Accuracy

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity Oct 5, 2025	Automatic Metrics	Not reported	Spearman	Not reported
Token Hidden Reward: Steering Exploration-Exploitation in Group Relative Deep Reinforcement Learning Oct 4, 2025	Automatic Metrics	Not reported	Accuracy, Pass@k	Not reported
Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval Oct 3, 2025	Automatic Metrics	Not reported	Ndcg	Not reported
Generative Value Conflicts Reveal LLM Priorities Sep 29, 2025	Automatic Metrics	Not reported	Harmlessness	Not reported
Incentive-Aligned Multi-Source LLM Summaries Sep 29, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
TimeOmni-1: Incentivizing Complex Reasoning with Time Series in Large Language Models Sep 29, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
On Discovering Algorithms for Adversarial Imitation Learning Oct 1, 2025	Simulation Env	Not reported	Not reported	Not reported
SECA: Semantically Equivalent and Coherent Attacks for Eliciting LLM Hallucinations Oct 5, 2025	Not reported	Not reported	Coherence	Not reported
BioX-Bridge: Model Bridging for Unsupervised Cross-Modal Knowledge Transfer across Biosignals Oct 2, 2025	Not reported	Not reported	Cost	Not reported
BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses Sep 30, 2025	Not reported	Biasfreebench	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (10.5% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (0% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (10.5% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (10.5% vs 35% target).
Moderate: Papers with known annotation unit

Coverage is usable but incomplete (26.3% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (10.5% coverage).
Benchmark coverage is thin (0% of papers mention benchmarks/datasets).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Track metric sensitivity by reporting both precision and spearman.

Recommended Queries

LLM-as-Judge Protocols Metric Slice: precision Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (10.5% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (6)
Llm As Judge (1)
Simulation Env (1)

Top Metrics

Precision (1)
Spearman (1)

Top Benchmarks

Quality Controls

Papers In This Archive Slice

SECA: Semantically Equivalent and Coherent Attacks for Eliciting LLM Hallucinations
Buyun Liang, Liangzu Peng, Jinqi Luo, Darshan Thaker, Kwan Ho Ryan Chan · Oct 5, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity
Zixin Song, Bowen Zhang, Qian-Wen Zhang, Di Yin, Xing Sun · Oct 5, 2025 · Citations: 0

Pairwise Preference

On the official C-STS benchmark, PoLi-RL achieves a Spearman correlation coefficient of 48.18, establishing a new SOTA for the cross-encoder architecture.
Token Hidden Reward: Steering Exploration-Exploitation in Group Relative Deep Reinforcement Learning
Wenlong Deng, Yi Ren, Yushu Li, Boying Gong, Danica J. Sutherland · Oct 4, 2025 · Citations: 0

We validate the efficacy of this algorithm on diverse math reasoning benchmarks.
AgentHub: A Registry for Discoverable, Verifiable, and Reproducible AI Agents
Erik Pautsch, Tanmay Singla, Parv Kumar, Wenxin Jiang, Huiyun Peng · Oct 3, 2025 · Citations: 0

LLM-based agents are rapidly proliferating, yet the infrastructure for discovering, evaluating, and governing them remains fragmented compared to mature ecosystems like software package registries (e.g., npm) and model hubs (e.g., Hugging…
Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval
Yohan Lee, Yongwoo Song, Sangyeop Kim · Oct 3, 2025 · Citations: 0

We present the Conversational Data Retrieval (CDR) benchmark, the first comprehensive test set for evaluating systems that retrieve conversation data for product insights.
BioX-Bridge: Model Bridging for Unsupervised Cross-Modal Knowledge Transfer across Biosignals
Chenqi Li, Yu Liu, Timothy Denison, Tingting Zhu · Oct 2, 2025 · Citations: 0

Biosignals offer valuable insights into the physiological states of the human body.
Can AI Truly Represent Your Voice in Deliberations? A Comprehensive Study of Large-Scale Opinion Aggregation with LLMs
Shenzhe Zhu, Shu Yang, Michiel A. Bakker, Alex Pentland, Jiaxin Pei · Oct 2, 2025 · Citations: 0

To address this, we present DeliberationBank, a large-scale human-grounded dataset with (1) opinion data spanning ten deliberation questions created by 3,000 participants and (2) summary judgment data annotated by 4,500 participants across…
On Discovering Algorithms for Adversarial Imitation Learning
Shashank Reddy Chirra, Jayden Teoh, Praveen Paruchuri, Pradeep Varakantham · Oct 1, 2025 · Citations: 0

Demonstrations

RA functions in AIL are typically derived from divergence minimization objectives, relying heavily on human design and ingenuity.
Hearing the Order: Investigating Position Bias in Large Audio-Language Models
Yu-Xiang Lin, Chen-An Li, Sheng-Lun Wei, Po-Chun Chen, Hsin-Hsi Chen · Oct 1, 2025 · Citations: 0

We demonstrate that no model is immune to this bias through extensive experiments on six LALMs across three widely used benchmarks and their spoken counterparts.
BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses
Xin Xu, Xunzhi He, Churan Zhi, Ruizhe Chen, Julian McAuley · Sep 30, 2025 · Citations: 0

Moreover, their evaluations are mostly based on the comparison between LLMs' probabilities of biased and unbiased contexts, which ignores the gap between such evaluations and real-world use cases where users interact with LLMs by reading…
Latent Thinking Optimization: Your Latent Reasoning Language Model Secretly Encodes Reward Signals in Its Latent Thoughts
Hanwen Du, Yuxin Dong, Xia Ning · Sep 30, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts
Yuan Zhuang, Yi Shen, Yuexin Bian, Qing Su, Shihao Ji · Sep 30, 2025 · Citations: 0

Extensive experiments on the Qwen3-1.7B and Llama-3.2-3B models show that LD-MoLE achieves the highest average scores compared to state-of-the-art baselines, across a diverse set of benchmarks.
Polychromic Objectives for Reinforcement Learning
Jubayer Ibn Hamid, Ifdita Hasan Orney, Ellen Xu, Chelsea Finn, Dorsa Sadigh · Sep 29, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Predicting Training Re-evaluation Curves Enables Effective Data Curriculums for LLMs
Shane Bergsma, Nolan Dey, Joel Hestness · Sep 29, 2025 · Citations: 0

We introduce the *training re-evaluation curve (TREC)*, a diagnostic that retrospectively evaluates training batches *using the final model weights*.
Generative Value Conflicts Reveal LLM Priorities
Andy Liu, Kshitish Ghate, Mona Diab, Daniel Fried, Atoosa Kasirzadeh · Sep 29, 2025 · Citations: 0

Comparing results between multiple-choice and open-ended evaluations, we find that models shift away from supporting protective values, such as harmlessness, and toward supporting personal values, such as user autonomy, in more open-ended…
Incentive-Aligned Multi-Source LLM Summaries
Yanchen Jiang, Zhe Feng, Aranyak Mehta · Sep 29, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
TimeOmni-1: Incentivizing Complex Reasoning with Time Series in Large Language Models
Tong Guan, Zijie Meng, Dianqi Li, Shiyu Wang, Chao-Han Huck Yang · Sep 29, 2025 · Citations: 0

TSR-Suite is the first comprehensive time series reasoning suite that supports not only thorough evaluation but also the data pipeline and training of TSRMs.
Inducing Dyslexia in Vision Language Models
Melika Honarmand, Ayati Sharma, Badr AlKhamissi, Johannes Mehrer, Martin Schrimpf · Sep 29, 2025 · Citations: 0

Using stimuli from cognitive neuroscience, we identify visual-word-form-selective units within VLMs and demonstrate that they predict human VWFA neural responses.
G-reasoner: Foundation Models for Unified Reasoning over Graph-structured Knowledge
Linhao Luo, Zicheng Zhao, Junnan Liu, Zhangchi Qiu, Junnan Dong · Sep 29, 2025 · Citations: 0

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote