HFEPX Archive Slice

HFEPX Fortnight Archive: 2025-F05

Updated from current HFEPX corpus (Mar 1, 2026). 12 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Mar 1, 2026). 12 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Calibration. Common metric signal: helpfulness. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 9, 2025.

Papers: 12 Last published: Mar 9, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Medium .

High-Signal Coverage

100.0%

12 / 12 papers are not low-signal flagged.

Benchmark Anchors

16.7%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

50.0%

Papers with reported metric mentions in extraction output.

1 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Why This Slice Matters (Expanded)

Why This Time Slice Matters

25% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 50% of papers in this hub.
multi-agent setups appears in 8.3% of papers, indicating agentic evaluation demand.

Protocol Notes (Expanded)

Protocol Takeaways For This Period

Most common quality-control signal is rater calibration (8.3% of papers).
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.
Add inter-annotator agreement checks when reproducing these protocols.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

PII-Bench: Evaluating Query-Aware Privacy Protection Systems
Feb 25, 2025 · Citations: 0 · Score: 4.4

Eval: Automatic Metrics · Metrics: Relevance
Compressing Language Models for Specialized Domains
Feb 25, 2025 · Citations: 0 · Score: 3.9

Eval: Automatic Metrics · Metrics: Cost
VQEL: Enabling Self-Play in Emergent Language Games via Agent-Internal Vector Quantization
Mar 6, 2025 · Citations: 0 · Score: 3.5

Eval: Automatic Metrics · Metrics: Task success
HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs
Mar 3, 2025 · Citations: 0 · Score: 3.5

Eval: Automatic Metrics · Metrics: Accuracy
Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks
Feb 28, 2025 · Citations: 0 · Score: 2.9

Eval: Not reported · Metrics: Helpfulness
InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models
Mar 9, 2025 · Citations: 0 · Score: 2.5

Eval: Not reported · Metrics: Not reported

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
PII-Bench: Evaluating Query-Aware Privacy Protection Systems Feb 25, 2025	Automatic Metrics	Pii Bench	Relevance	Not reported
Compressing Language Models for Specialized Domains Feb 25, 2025	Automatic Metrics	Not reported	Cost	Calibration
VQEL: Enabling Self-Play in Emergent Language Games via Agent-Internal Vector Quantization Mar 6, 2025	Automatic Metrics	Not reported	Task success	Not reported
HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs Mar 3, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks Feb 28, 2025	Not reported	Not reported	Helpfulness	Not reported
InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models Mar 9, 2025	Not reported	MATH 500, GPQA	Not reported	Not reported
Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling Mar 6, 2025	Not reported	Not reported	Throughput	Not reported
Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence Feb 24, 2025	Automatic Metrics	Not reported	Not reported	Not reported
Can Multimodal LLMs Perform Time Series Anomaly Detection? Feb 25, 2025	Automatic Metrics	Not reported	Not reported	Not reported
HIPPO: Enhancing the Table Understanding Capability of LLMs through Hybrid-Modal Preference Optimization Feb 24, 2025	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (25% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (8.3% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (0% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (8.3% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (8.3% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (16.7% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 8.3% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (8.3% coverage).
Annotation unit is under-specified (16.7% coverage).

Suggested Next Analyses

Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

Metric Slice: helpfulness Recent High-Signal Papers

Known Limitations

Only 8.3% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (8.3% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (6)

Top Metrics

Helpfulness (1)

Top Benchmarks

Quality Controls

Calibration (1)

Papers In This Archive Slice

InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models
Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Mengdi Zhang · Mar 9, 2025 · Citations: 0

Experiments across multiple model architectures demonstrate that our approach reduces computational costs while improving performance, with Qwen2.5-Math-7B showing 3-11% improvements across MATH500, AIME24, and GPQA_diamond benchmarks.
VQEL: Enabling Self-Play in Emergent Language Games via Agent-Internal Vector Quantization
Mohammad Mahdi Samiei Paqaleh, Mehdi Jamalkhah, Mahdieh Soleymani Baghshah · Mar 6, 2025 · Citations: 0

Emergent Language (EL) focuses on the emergence of communication among artificial agents.
Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling
Yan Li, Zhenyu Zhang, Zhengang Wang, Pengfei Chen, Pengfei Zheng · Mar 6, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs
Tin Nguyen, Logan Bolton, Mohammad Reza Taesiri, Trung Bui, Anh Totti Nguyen · Mar 3, 2025 · Citations: 0

A response mixed of factual and non-factual statements poses a challenge for humans to verify and accurately base their decisions on.
Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks
Hanjiang Hu, Alexander Robey, Changliu Liu · Feb 28, 2025 · Citations: 0

Red Team

To address this challenge, we propose a safety steering framework grounded in safe control theory, ensuring invariant safety in multi-turn dialogues.
The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Shir Ashury-Tahan, Yifan Mai, Rajmohan C, Ariel Gera, Yotam Perlitz · Feb 26, 2025 · Citations: 0

To address this gap, we create ToRR, a benchmark for Table Reasoning and Robustness, measuring model performance and robustness on table-related tasks.
Compressing Language Models for Specialized Domains
Miles Williams, George Chrysostomou, Vitor Jeronymo, Nikolaos Aletras · Feb 25, 2025 · Citations: 0

Compression techniques such as pruning and quantization offer a practical path towards efficient LM deployment, exemplified by their ability to preserve performance on general-purpose benchmarks.
PII-Bench: Evaluating Query-Aware Privacy Protection Systems
Hao Shen, Zhouhong Gu, Haokai Hong, Weili Han · Feb 25, 2025 · Citations: 0

To address this challenge, we propose a query-unrelated PII masking strategy and introduce PII-Bench, the first comprehensive evaluation framework for assessing privacy protection systems.
Can Multimodal LLMs Perform Time Series Anomaly Detection?
Xiongxiao Xu, Haoran Wang, Yueqing Liang, Philip S. Yu, Yue Zhao · Feb 25, 2025 · Citations: 0

Multi Agent

One natural way for humans to detect time series anomalies is through visualization and textual description.
Bridging Gaps in Natural Language Processing for Yorùbá: A Systematic Review of a Decade of Progress and Prospects
Toheeb Aduramomi Jimoh, Tabea De Wille, Nikola S. Nikolov · Feb 24, 2025 · Citations: 0

Natural Language Processing (NLP) is becoming a dominant subset of artificial intelligence as the need to help machines understand human language looks indispensable.
HIPPO: Enhancing the Table Understanding Capability of LLMs through Hybrid-Modal Preference Optimization
Haolan Wang, Zhenghao Liu, Xinze Li, Xiaocui Yang, Yu Gu · Feb 24, 2025 · Citations: 0

Pairwise Preference

To better capture structural semantics from the tabular data, this paper introduces the HybrId-modal Preference oPtimizatiOn (HIPPO) model, which represents tables using both text and image, optimizing MLLMs by learning more comprehensive…
Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence
Wenzhe Yin, Zehao Xiao, Pan Zhou, Shujian Yu, Jiayi Shen · Feb 24, 2025 · Citations: 0

Pairwise Preference

Vision-language alignment is crucial for various downstream tasks such as cross-modal generation and retrieval.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote