HFEPX Archive Slice

HFEPX Daily Archive: 2025-05-28

Updated from current HFEPX corpus (Apr 9, 2026). 8 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 9, 2026). 8 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Frequently cited benchmark: Rtc-Bench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from May 28, 2025.

Papers: 8 Last published: May 28, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Medium .

High-Signal Coverage

100.0%

8 / 8 papers are not low-signal flagged.

Benchmark Anchors

25.0%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

50.0%

Papers with reported metric mentions in extraction output.

0 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Get this digest every Friday →

Why This Time Slice Matters

37.5% of papers report explicit human-feedback signals, led by expert verification.
automatic metrics appears in 50% of papers in this hub.
Rtc-Bench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly mixed annotation units; use this to scope replication staffing.
Stratify by benchmark (Rtc-Bench vs SYCON-Bench) before comparing methods.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
May 28, 2025 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Jailbreak success rate
Reviewing Scientific Papers for Critical Problems With Reasoning LLMs: Baseline Approaches and Automatic Evaluation
May 28, 2025 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Cost
Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation
May 28, 2025 · Citations: 0 · Score: 3.5

Eval: Automatic Metrics · Metrics: Accuracy, Perplexity
Flying Pigs, FaR and Beyond: Evaluating LLM Reasoning in Counterfactual Worlds
May 28, 2025 · Citations: 0 · Score: 3.5

Eval: Automatic Metrics · Metrics: Accuracy
Mixture-of-Retrieval Experts for Reasoning-Guided Multimodal Knowledge Exploitation
May 28, 2025 · Citations: 0 · Score: 2.5

Eval: Not reported · Metrics: Not reported
Counting trees: A treebank-driven exploration of syntactic variation in speech and writing across languages
May 28, 2025 · Citations: 0 · Score: 2.0

Eval: Not reported · Metrics: Not reported

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments May 28, 2025	Automatic Metrics	Rtc Bench	Jailbreak success rate	Not reported
Reviewing Scientific Papers for Critical Problems With Reasoning LLMs: Baseline Approaches and Automatic Evaluation May 28, 2025	Automatic Metrics	Not reported	Cost	Not reported
Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation May 28, 2025	Automatic Metrics	Not reported	Accuracy, Perplexity	Not reported
Flying Pigs, FaR and Beyond: Evaluating LLM Reasoning in Counterfactual Worlds May 28, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
Mixture-of-Retrieval Experts for Reasoning-Guided Multimodal Knowledge Exploitation May 28, 2025	Not reported	Mixture Of Retrieval	Not reported	Not reported
Counting trees: A treebank-driven exploration of syntactic variation in speech and writing across languages May 28, 2025	Not reported	Not reported	Not reported	Not reported
StressTest: Can YOUR Speech LM Handle the Stress? May 28, 2025	Not reported	Not reported	Not reported	Not reported
Measuring Sycophancy of Language Models in Multi-turn Dialogues May 28, 2025	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Moderate: Papers with explicit human feedback

Coverage is usable but incomplete (37.5% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (25% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (37.5% vs 35% target).
Moderate: Papers with known rater population

Coverage is usable but incomplete (25% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (0% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Annotation unit is under-specified (0% coverage).

Suggested Next Analyses

Stratify by benchmark (Rtc-Bench vs SYCON-Bench) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.

Recommended Queries

Benchmark Slice: Rtc-Bench Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Annotation unit is under-specified (0% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (4)

Top Metrics

Accuracy (1)
Cost (1)
Jailbreak success rate (1)

Top Benchmarks

Rtc Bench (1)
SYCON Bench (1)

Quality Controls

Papers In This Archive Slice

Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation
Arthur S. Bianchessi, Yasmin C. Aguirre, Rodrigo C. Barros, Lucas S. Kupssinskü · May 28, 2025 · Citations: 0

However, existing PE methods lack theoretical clarity and rely on limited evaluation metrics to substantiate their extrapolation claims.
Counting trees: A treebank-driven exploration of syntactic variation in speech and writing across languages
Kaja Dobrovoljc · May 28, 2025 · Citations: 0

Pairwise Preference

Results show that, across both languages, spoken corpora contain fewer and less diverse syntactic structures than their written counterparts, with consistent cross-linguistic preferences for certain structural types across modalities.
StressTest: Can YOUR Speech LM Handle the Stress?
Iddo Yosha, Gallil Maimon, Yossi Adi · May 28, 2025 · Citations: 0

Despite the crucial role of sentence stress in shaping meaning and intent, it remains largely overlooked in evaluation and development of SLMs.
Measuring Sycophancy of Language Models in Multi-turn Dialogues
Jiseung Hong, Grace Byun, Seungone Kim, Kai Shu, Jinho D. Choi · May 28, 2025 · Citations: 0
Flying Pigs, FaR and Beyond: Evaluating LLM Reasoning in Counterfactual Worlds
Anish R Joishy, Ishwar B Balappanawar, Vamshi Krishna Bonagiri, Manas Gaur, Krishnaprasad Thirunarayan · May 28, 2025 · Citations: 0

Evaluation of 11 LLMs across six diverse reasoning datasets reveals a consistent failure: model accuracy plummets by an average of 14% in counterfactual scenarios compared to knowledge-aligned ones.
Mixture-of-Retrieval Experts for Reasoning-Guided Multimodal Knowledge Exploitation
Chunyi Peng, Zhipeng Xu, Zhenghao Liu, Yishan Li, Yukun Yan · May 28, 2025 · Citations: 0

Experimental results on diverse open-domain QA benchmarks demonstrate the effectiveness of MoRE, achieving average performance gains of over 7% compared to competitive baselines.
Reviewing Scientific Papers for Critical Problems With Reasoning LLMs: Baseline Approaches and Automatic Evaluation
Tianmai M. Zhang, Neil F. Abernethy · May 28, 2025 · Citations: 0

Expert Verification

However, having AI models generate full reviews in the same way as human reviewers risks exacerbating the irresponsible use of LLM-generated reviews and instigating intentional manipulation.
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
Zeyi Liao, Jaylen Jones, Linxi Jiang, Yuting Ning, Eric Fosler-Lussier · May 28, 2025 · Citations: 0

Red Team Web Browsing

Using RedTeamCUA, we develop RTC-Bench, a comprehensive benchmark with 864 examples that investigate realistic, hybrid web-OS attack scenarios and fundamental security vulnerabilities.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote