HFEPX Archive Slice

HFEPX Fortnight Archive: 2025-F05

Updated from current HFEPX corpus (Apr 18, 2026). 39 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 18, 2026). 39 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Calibration. Frequently cited benchmark: Bff-Bench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 9, 2025.

Papers: 39 Last published: Mar 9, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

High-Signal Coverage

100.0%

39 / 39 papers are not low-signal flagged.

Benchmark Anchors

15.4%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

41.0%

Papers with reported metric mentions in extraction output.

1 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.

Get this digest every Friday →

Why This Time Slice Matters

10.3% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 30.8% of papers in this hub.
Bff-Bench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is rater calibration (2.6% of papers).
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding
Mar 7, 2025 · Citations: 0 · Score: 5.4

Eval: Llm As Judge · Metrics: Agreement, Cost
PII-Bench: Evaluating Query-Aware Privacy Protection Systems
Feb 25, 2025 · Citations: 0 · Score: 4.4

Eval: Automatic Metrics · Metrics: Relevance
Compressing Language Models for Specialized Domains
Feb 25, 2025 · Citations: 0 · Score: 3.9

Eval: Automatic Metrics · Metrics: Cost
Collaborative Evaluation of Deepfake Text with Deliberation-Enhancing Dialogue Systems
Mar 6, 2025 · Citations: 0 · Score: 2.9

Eval: Automatic Metrics · Metrics: Accuracy
VQEL: Enabling Self-Play in Emergent Language Games via Agent-Internal Vector Quantization
Mar 6, 2025 · Citations: 0 · Score: 2.9

Eval: Automatic Metrics · Metrics: Task success
HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs
Mar 3, 2025 · Citations: 0 · Score: 2.9

Eval: Automatic Metrics · Metrics: Accuracy

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding Mar 7, 2025	Llm As Judge	MT Bench, Bff Bench	Agreement, Cost	Not reported
PII-Bench: Evaluating Query-Aware Privacy Protection Systems Feb 25, 2025	Automatic Metrics	Pii Bench	Relevance	Not reported
Compressing Language Models for Specialized Domains Feb 25, 2025	Automatic Metrics	Not reported	Cost	Calibration
Collaborative Evaluation of Deepfake Text with Deliberation-Enhancing Dialogue Systems Mar 6, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
VQEL: Enabling Self-Play in Emergent Language Games via Agent-Internal Vector Quantization Mar 6, 2025	Automatic Metrics	Not reported	Task success	Not reported
HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs Mar 3, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
LLM-Advisor: An LLM Benchmark for Cost-efficient Path Planning across Multiple Terrains Mar 3, 2025	Automatic Metrics	Not reported	Cost	Not reported
Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks Feb 28, 2025	Not reported	Not reported	Helpfulness	Not reported
Prediction of Item Difficulty for Reading Comprehension Items by Creation of Annotated Item Repository Feb 28, 2025	Automatic Metrics	Not reported	Rmse	Not reported
HaLoRA: Hardware-aware Low-Rank Adaptation for Large Language Models Based on Hybrid Compute-in-Memory Architecture Feb 27, 2025	Automatic Metrics	Not reported	Accuracy, Cost	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (10.3% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (2.6% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (2.6% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (20.5% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (15.4% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (5.1% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 2.6% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (15.4% coverage).
Annotation unit is under-specified (5.1% coverage).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (Bff-Bench vs MT-Bench) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: Bff-Bench Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 2.6% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (15.4% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (12)
Llm As Judge (2)
Simulation Env (1)

Top Metrics

Accuracy (5)
Cost (2)
Agreement (1)
Helpfulness (1)

Top Benchmarks

Bff Bench (1)
MT Bench (1)

Quality Controls

Calibration (1)

Papers In This Archive Slice

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye · Mar 9, 2025 · Citations: 0
Green Prompting: Characterizing Prompt-driven Energy Costs of LLM Inference
Marta Adamska, Daria Smirnova, Hamid Nasiri, Zhengxin Yu, Peter Garraghan · Mar 9, 2025 · Citations: 0

Web Browsing

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models
Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Mengdi Zhang · Mar 9, 2025 · Citations: 0

Experiments across multiple model architectures demonstrate that our approach reduces computational costs while improving performance, with Qwen2.5-Math-7B showing 3-11% improvements across MATH500, AIME24, and GPQA_diamond benchmarks.
Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLMs
Zara Siddique, Irtaza Khalid, Liam D. Turner, Luis Espinosa-Anke · Mar 7, 2025 · Citations: 0

When optimized on the BBQ dataset, our individually tuned steering vectors achieve average improvements of 12.8% on BBQ, 8.3% on CLEAR-Bias, and 1% on StereoSet, and show improvements over prompting and Self-Debias in all cases, and…
Frequency Autoregressive Image Generation with Continuous Tokens
Hu Yu, Hao Luo, Hangjie Yuan, Yu Rong, Jie Huang · Mar 7, 2025 · Citations: 0

However, due to the huge modality gap, image autoregressive models may require a systematic reevaluation from two perspectives: tokenizer format and regression direction.
No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding
Michael Krumdick, Charles Lovering, Varshini Reddy, Seth Ebner, Chris Tanner · Mar 7, 2025 · Citations: 0

Pairwise Preference

To address this gap, we introduce the Business and Finance Fundamentals Benchmark (BFF-Bench), a dataset of 160 challenging questions and long-form responses authored by financial professionals.
Collaborative Evaluation of Deepfake Text with Deliberation-Enhancing Dialogue Systems
Jooyoung Lee, Xiaochen Zhu, Georgi Karadzhov, Tom Stafford, Andreas Vlachos · Mar 6, 2025 · Citations: 0

The proliferation of generative models has presented significant challenges in distinguishing authentic human-authored content from deepfake content.
VQEL: Enabling Self-Play in Emergent Language Games via Agent-Internal Vector Quantization
Mohammad Mahdi Samiei Paqaleh, Mehdi Jamalkhah, Mahdieh Soleymani Baghshah · Mar 6, 2025 · Citations: 0

Emergent Language (EL) focuses on the emergence of communication among artificial agents.
Training-free Adjustable Polynomial Graph Filtering for Ultra-fast Multimodal Recommendation
Yu-Seung Roh, Joo-Young Kim, Jin-Duk Park, Won-Yong Shin · Mar 6, 2025 · Citations: 0
Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling
Yan Li, Zhenyu Zhang, Zhengang Wang, Pengfei Chen, Pengfei Zheng · Mar 6, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions
Emmy Liu, Amanda Bertsch, Lintang Sutawika, Lindia Tjuatja, Patrick Fernandes · Mar 5, 2025 · Citations: 0
LINGOLY-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation
Jude Khouja, Lingyi Yang, Karolina Korgul, Simeon Hellsten, Vlad A. Neacsu · Mar 4, 2025 · Citations: 0

We introduce LINGOLY-TOO, a challenging reasoning benchmark of 1,203 questions and a total of 6,995 sub-questions that counters these shortcuts by applying expert-designed obfuscations to Linguistics Olympiad problems.
Wikipedia in the Era of LLMs: Evolution and Risks
Siming Huang, Yuliang Xu, Mingmeng Geng, Yao Wan, Dongping Chen · Mar 4, 2025 · Citations: 0

If the machine translation benchmark based on Wikipedia is influenced by LLMs, the scores of the models may become inflated, and the comparative results among models could shift.
Rewarding Doubt: A Reinforcement Learning Approach to Calibrated Confidence Expression of Large Language Models
David Bani-Harouni, Chantal Pellegrini, Paul Stangel, Ege Özsoy, Kamilia Zaripova · Mar 4, 2025 · Citations: 0
HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs
Tin Nguyen, Logan Bolton, Mohammad Reza Taesiri, Trung Bui, Anh Totti Nguyen · Mar 3, 2025 · Citations: 0

A response mixed of factual and non-factual statements poses a challenge for humans to verify and accurately base their decisions on.
$\texttt{SEM-CTRL}$: Semantically Controlled Decoding
Mohammad Albinhassan, Pranava Madhyastha, Alessandra Russo · Mar 3, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
LLM-Advisor: An LLM Benchmark for Cost-efficient Path Planning across Multiple Terrains
Ling Xiao, Toshihiko Yamasaki · Mar 3, 2025 · Citations: 0

Web Browsing

We further introduce two datasets, MultiTerraPath and RUGD_v2, for systematic evaluation of cost-efficient path planning.
Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks
Hanjiang Hu, Alexander Robey, Changliu Liu · Feb 28, 2025 · Citations: 0

Red Team

To address this challenge, we propose a safety steering framework grounded in safe control theory, ensuring invariant safety in multi-turn dialogues.
Causality Is Key to Understand and Balance Multiple Goals in Trustworthy ML and Foundation Models
Ruta Binkyte, Ivaxi Sheth, Zhijing Jin, Mohammad Havaei, Bernhard Schölkopf · Feb 28, 2025 · Citations: 0
Prediction of Item Difficulty for Reading Comprehension Items by Creation of Annotated Item Repository
Radhika Kapoor, Sang T. Truong, Nick Haber, Maria Araceli Ruiz-Primo, Benjamin W. Domingue · Feb 28, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
HaLoRA: Hardware-aware Low-Rank Adaptation for Large Language Models Based on Hybrid Compute-in-Memory Architecture
Taiqiang Wu, Chenchen Ding, Wenyong Zhou, Yuxin Cheng, Xincheng Feng · Feb 27, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Stay Focused: Problem Drift in Multi-Agent Debate
Jonas Becker, Lars Benedikt Kaesberg, Andreas Stephan, Jan Philip Wahle, Terry Ruas · Feb 26, 2025 · Citations: 0

Multi Agent

Multi-agent debate - multiple instances of large language models discussing problems in turn-based interaction - has shown promise for solving knowledge and reasoning tasks.
The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Shir Ashury-Tahan, Yifan Mai, Rajmohan C, Ariel Gera, Yotam Perlitz · Feb 26, 2025 · Citations: 0

To address this gap, we create ToRR, a benchmark for Table Reasoning and Robustness, measuring model performance and robustness on table-related tasks.
Low-Confidence Gold: Refining Low-Confidence Samples for Efficient Instruction Tuning
Hongyi Cai, Jie Li, Mohammad Mahdinur Rahman, Wenzhen Dong · Feb 26, 2025 · Citations: 0

Experimental evaluation demonstrates that models fine-tuned on LCG-filtered subsets of 6K samples achieve superior performance compared to existing methods, with substantial improvements on MT-bench and consistent gains across comprehensive…
Transforming the Voice of the Customer: Large Language Models for Identifying Customer Needs
Artem Timoshenko, Chengfeng Mao, John R. Hauser · Feb 25, 2025 · Citations: 0

While current practice uses machine learning to screen content, the critical final step of precisely formulating CNs relies on expert human judgment.
Compressing Language Models for Specialized Domains
Miles Williams, George Chrysostomou, Vitor Jeronymo, Nikolaos Aletras · Feb 25, 2025 · Citations: 0

Compression techniques such as pruning and quantization offer a practical path towards efficient LM deployment, exemplified by their ability to preserve performance on general-purpose benchmarks.
Beyond In-Distribution Success: Scaling Curves of CoT Granularity for Language Model Generalization
Ru Wang, Wei Huang, Selena Song, Haoyu Zhang, Qian Niu · Feb 25, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
PII-Bench: Evaluating Query-Aware Privacy Protection Systems
Hao Shen, Zhouhong Gu, Haokai Hong, Weili Han · Feb 25, 2025 · Citations: 0

To address this challenge, we propose a query-unrelated PII masking strategy and introduce PII-Bench, the first comprehensive evaluation framework for assessing privacy protection systems.
Connecting Voices: LoReSpeech as a Low-Resource Speech Parallel Corpus
Samy Ouzerrout · Feb 25, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Can Multimodal LLMs Perform Time Series Anomaly Detection?
Xiongxiao Xu, Haoran Wang, Yueqing Liang, Philip S. Yu, Yue Zhao · Feb 25, 2025 · Citations: 0

Multi Agent

One natural way for humans to detect time series anomalies is through visualization and textual description.
Renormalization-Inspired Effective Field Neural Networks for Scalable Modeling of Classical and Quantum Many-Body Systems
Xi Liu, Yujun Zhao, Chun Yu Wan, Yang Zhang, Junwei Liu · Feb 24, 2025 · Citations: 0
LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification
Penghui Yang, Cunxiao Du, Fengzhuo Zhang, Haonan Wang, Tianyu Pang · Feb 24, 2025 · Citations: 0

As Large Language Models (LLMs) can now process extremely long contexts, efficient inference over these extended inputs has become increasingly important, especially for emerging applications like LLM agents that highly depend on this…
Bridging Gaps in Natural Language Processing for Yorùbá: A Systematic Review of a Decade of Progress and Prospects
Toheeb Aduramomi Jimoh, Tabea De Wille, Nikola S. Nikolov · Feb 24, 2025 · Citations: 0

Natural Language Processing (NLP) is becoming a dominant subset of artificial intelligence as the need to help machines understand human language looks indispensable.
HIPPO: Enhancing the Table Understanding Capability of LLMs through Hybrid-Modal Preference Optimization
Haolan Wang, Zhenghao Liu, Xinze Li, Xiaocui Yang, Yu Gu · Feb 24, 2025 · Citations: 0

Pairwise Preference

To better capture structural semantics from the tabular data, this paper introduces the HybrId-modal Preference oPtimizatiOn (HIPPO) model, which represents tables using both text and image, optimizing MLLMs by learning more comprehensive…
Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective
Chengyin Xu, Kaiyuan Chen, Xiao Li, Ke Shen, Chenggang Li · Feb 24, 2025 · Citations: 0

Predictable subset performance acts as an intermediate predictor for the full evaluation set.
From Euler to AI: Unifying Formulas for Mathematical Constants
Tomer Raz, Michael Shalyt, Elyasheev Leibtag, Rotem Kalisch, Shachar Weinbaum · Feb 24, 2025 · Citations: 0
Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence
Wenzhe Yin, Zehao Xiao, Pan Zhou, Shujian Yu, Jiayi Shen · Feb 24, 2025 · Citations: 0

Pairwise Preference

Vision-language alignment is crucial for various downstream tasks such as cross-modal generation and retrieval.
Make LoRA Great Again: Boosting LoRA with Adaptive Singular Values and Mixture-of-Experts Optimization Alignment
Chenghao Fan, Zhenyi Lu, Sichen Liu, Chengfeng Gu, Xiaoye Qu · Feb 24, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Language Model Fine-Tuning on Scaled Survey Data for Predicting Distributions of Public Opinions
Joseph Suh, Erfan Jahanparast, Suhong Moon, Minwoo Kang, Serina Chang · Feb 24, 2025 · Citations: 0

Prior methods steer LLMs via descriptions of subpopulations as LLMs' input prompt, yet such prompt engineering approaches have struggled to faithfully predict the distribution of survey responses from human subjects.

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now