HFEPX Archive Slice

HFEPX Weekly Archive: 2025-W22

Updated from current HFEPX corpus (Mar 8, 2026). 18 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Mar 8, 2026). 18 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics. Frequently cited benchmark: Rtc-Bench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Jun 1, 2025.

Papers: 18 Last published: Jun 1, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Medium .

High-Signal Coverage

100.0%

18 / 18 papers are not low-signal flagged.

Benchmark Anchors

11.1%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

16.7%

Papers with reported metric mentions in extraction output.

0 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Why This Time Slice Matters

16.7% of papers report explicit human-feedback signals, led by critique/edit feedback.
automatic metrics appears in 16.7% of papers in this hub.
Rtc-Bench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Stratify by benchmark (Rtc-Bench vs SYCON-Bench) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
May 28, 2025 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Jailbreak success rate
SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models
Jun 1, 2025 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: Accuracy
Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation
May 28, 2025 · Citations: 0 · Score: 3.5

Eval: Automatic Metrics · Metrics: Accuracy, Perplexity
Counting trees: A treebank-driven exploration of syntactic variation in speech and writing across languages
May 28, 2025 · Citations: 0 · Score: 2.0

Eval: Not reported · Metrics: Not reported
REA-RL: Reflection-Aware Online Reinforcement Learning for Efficient Reasoning
May 26, 2025 · Citations: 0 · Score: 2.0

Eval: Not reported · Metrics: Not reported
When Large Multimodal Models Confront Evolving Knowledge: Challenges and Explorations
May 30, 2025 · Citations: 0 · Score: 1.0

Eval: Not reported · Metrics: Not reported

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments May 28, 2025	Automatic Metrics	Rtc Bench	Jailbreak success rate	Not reported
SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models Jun 1, 2025	Automatic Metrics	Needle In A Haystack	Accuracy	Not reported
Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation May 28, 2025	Automatic Metrics	Not reported	Accuracy, Perplexity	Not reported
Counting trees: A treebank-driven exploration of syntactic variation in speech and writing across languages May 28, 2025	Not reported	Not reported	Not reported	Not reported
REA-RL: Reflection-Aware Online Reinforcement Learning for Efficient Reasoning May 26, 2025	Not reported	Not reported	Not reported	Not reported
When Large Multimodal Models Confront Evolving Knowledge: Challenges and Explorations May 30, 2025	Not reported	Not reported	Not reported	Not reported
PonderLM: Pretraining Language Models to Ponder in Continuous Space May 27, 2025	Not reported	Not reported	Not reported	Not reported
FinTagging: Benchmarking LLMs for Extracting and Structuring Financial Information May 27, 2025	Not reported	Not reported	Not reported	Not reported
Types of Relations: Defining Analogies with Category Theory May 26, 2025	Not reported	Not reported	Not reported	Not reported
DeepQuestion: Systematic Generation of Real-World Challenges for Evaluating LLMs Performance May 30, 2025	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (16.7% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (11.1% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (44.4% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (0% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (0% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Annotation unit is under-specified (0% coverage).

Suggested Next Analyses

Stratify by benchmark (Rtc-Bench vs SYCON-Bench) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.

Recommended Queries

Benchmark Slice: Rtc-Bench Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (3)

Top Metrics

Accuracy (6)
Cost (2)
Faithfulness (1)
Jailbreak success rate (1)

Top Benchmarks

Rtc Bench (1)
SYCON Bench (1)

Quality Controls

Papers In This Archive Slice

SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models
Thinh Pham, Nguyen Nguyen, Pratibha Zunjare, Weiyuan Chen, Yu-Min Tseng · Jun 1, 2025 · Citations: 0

We introduce SealQA, a new challenge benchmark for evaluating SEarch-Augmented Language models on fact-seeking questions where web search yields conflicting, noisy, or unhelpful results.
DeepQuestion: Systematic Generation of Real-World Challenges for Evaluating LLMs Performance
Ali Khoramfar, Ali Ramezani, Mohammad Mahdi Mohajeri, Mohammad Javad Dousti, Majid Nili Ahmadabadi · May 30, 2025 · Citations: 0
When Large Multimodal Models Confront Evolving Knowledge: Challenges and Explorations
Kailin Jiang, Yuntao Du, Yukai Ding, Yuchen Ren, Ning Jiang · May 30, 2025 · Citations: 0

To address this, we first propose a pipeline to construct MMEVOKE, a benchmark for evaluating LMMs' ability in multimodal evolving knowledge injection.
Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation
Arthur S. Bianchessi, Yasmin C. Aguirre, Rodrigo C. Barros, Lucas S. Kupssinskü · May 28, 2025 · Citations: 0

However, existing PE methods lack theoretical clarity and rely on limited evaluation metrics to substantiate their extrapolation claims.
Counting trees: A treebank-driven exploration of syntactic variation in speech and writing across languages
Kaja Dobrovoljc · May 28, 2025 · Citations: 0

Pairwise Preference

Results show that, across both languages, spoken corpora contain fewer and less diverse syntactic structures than their written counterparts, with consistent cross-linguistic preferences for certain structural types across modalities.
Measuring Sycophancy of Language Models in Multi-turn Dialogues
Jiseung Hong, Grace Byun, Seungone Kim, Kai Shu, Jinho D. Choi · May 28, 2025 · Citations: 0
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
Zeyi Liao, Jaylen Jones, Linxi Jiang, Yuting Ning, Eric Fosler-Lussier · May 28, 2025 · Citations: 0

Red Team Web Browsing

Using RedTeamCUA, we develop RTC-Bench, a comprehensive benchmark with 864 examples that investigate realistic, hybrid web-OS attack scenarios and fundamental security vulnerabilities.
VeriTrail: Closed-Domain Hallucination Detection with Traceability
Dasha Metropolitansky, Jonathan Larson · May 27, 2025 · Citations: 0
R1-Code-Interpreter: LLMs Reason with Code via Supervised and Multi-stage Reinforcement Learning
Yongchao Chen, Yueying Liu, Junwei Zhou, Yilun Hao, Jingquan Wang · May 27, 2025 · Citations: 0
RefTool: Reference-Guided Tool Creation for Knowledge-Intensive Reasoning
Xiao Liu, Da Yin, Zirui Wu, Yansong Feng · May 27, 2025 · Citations: 0
Augmenting Research Ideation with Data: An Empirical Investigation in Social Science
Xiao Liu, Xinyi Dong, Xinyang Gao, Yansong Feng, Xun Pang · May 27, 2025 · Citations: 0
RPM: Reasoning-Level Personalization for Black-Box Large Language Models
Jieyong Kim, Tongyoung Kim, Soojin Yoon, Jaehyung Kim, Dongha Lee · May 27, 2025 · Citations: 0
Tracing and Reversing Edits in LLMs
Paul Youssef, Zhixue Zhao, Christin Seifert, Jörg Schlötterer · May 27, 2025 · Citations: 0
PonderLM: Pretraining Language Models to Ponder in Continuous Space
Boyi Zeng, Shixiang Song, Siyuan Huang, Yixuan Wang, He Li · May 27, 2025 · Citations: 0

Humans ponder before articulating complex sentence elements, enabling deeper cognitive processing through focused effort.
FinTagging: Benchmarking LLMs for Extracting and Structuring Financial Information
Yan Wang, Lingfei Qian, Xueqing Peng, Yang Ren, Keyi Wang · May 27, 2025 · Citations: 0

Existing benchmarks oversimplify this task as flat, single step classification over small subsets of concepts, ignoring the hierarchical semantics of the taxonomy and the structured nature of financial documents.
Characterizing Pattern Matching and Its Limits on Compositional Task Structures
Hoyeon Chang, Jinho Park, Hanseul Cho, Sohee Yang, Miyoung Ko · May 26, 2025 · Citations: 0
REA-RL: Reflection-Aware Online Reinforcement Learning for Efficient Reasoning
Hexuan Deng, Wenxiang Jiao, Xuebo Liu, Jun Rao, Min Zhang · May 26, 2025 · Citations: 0

Critique Edit

To address these issues, we propose REA-RL, which introduces a small reflection model for efficient scaling in online training, offering both parallel sampling and sequential revision.
Types of Relations: Defining Analogies with Category Theory
Claire Ott, Frank Jäkel · May 26, 2025 · Citations: 0

In order to behave intelligently both humans and machines have to represent their knowledge adequately for how it is used.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote