HFEPX Archive Slice

HFEPX Daily Archive: 2026-03-03

Updated from current HFEPX corpus (Mar 8, 2026). 44 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Mar 8, 2026). 44 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequent quality control: Calibration. Frequently cited benchmark: Kernelbench. Common metric signal: success rate. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 3, 2026.

Papers: 44 Last published: Mar 3, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

High-Signal Coverage

100.0%

44 / 44 papers are not low-signal flagged.

Benchmark Anchors

13.6%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

38.6%

Papers with reported metric mentions in extraction output.

1 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.

Why This Time Slice Matters

13.6% of papers report explicit human-feedback signals, led by expert verification.
automatic metrics appears in 34.1% of papers in this hub.
Kernelbench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is rater calibration (2.3% of papers).
Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning
Mar 3, 2026 · Citations: 0 · Score: 7.5

Eval: Automatic Metrics · Metrics: Success rate
Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification
Mar 3, 2026 · Citations: 0 · Score: 7.0

Eval: Automatic Metrics · Metrics: Brier score, Auroc
Think, But Don't Overthink: Reproducing Recursive Language Models
Mar 3, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Accuracy
PrivMedChat: End-to-End Differentially Private RLHF for Medical Dialogue Systems
Mar 3, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Rouge
MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models
Mar 3, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Success rate, Jailbreak success rate
Belief-Sim: Towards Belief-Driven Simulation of Demographic Misinformation Susceptibility
Mar 3, 2026 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics, Simulation Env · Metrics: Accuracy

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning Mar 3, 2026	Automatic Metrics	Kernelbench	Success rate	Not reported
Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification Mar 3, 2026	Automatic Metrics	Not reported	Brier score, Auroc	Calibration
Think, But Don't Overthink: Reproducing Recursive Language Models Mar 3, 2026	Automatic Metrics	Needle In A Haystack	Accuracy	Not reported
PrivMedChat: End-to-End Differentially Private RLHF for Medical Dialogue Systems Mar 3, 2026	Automatic Metrics	Not reported	Rouge	Not reported
MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models Mar 3, 2026	Automatic Metrics	Not reported	Success rate, Jailbreak success rate	Not reported
Belief-Sim: Towards Belief-Driven Simulation of Demographic Misinformation Susceptibility Mar 3, 2026	Automatic Metrics, Simulation Env	Not reported	Accuracy	Not reported
Evaluating Performance Drift from Model Switching in Multi-Turn LLM Systems Mar 3, 2026	Automatic Metrics	Not reported	F1, Success rate	Not reported
MaBERT:A Padding Safe Interleaved Transformer Mamba Hybrid Encoder for Efficient Extended Context Masked Language Modeling Mar 3, 2026	Automatic Metrics	Not reported	Latency	Not reported
Contextualized Privacy Defense for LLM Agents Mar 3, 2026	Simulation Env	Not reported	Helpfulness	Not reported
ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation Mar 3, 2026	Automatic Metrics	Not reported	Cost	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (13.6% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (2.3% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (2.3% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (13.6% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (11.4% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (15.9% vs 35% target).

Strengths

Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.

Known Gaps

Only 2.3% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (11.4% coverage).
Annotation unit is under-specified (15.9% coverage).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Track metric sensitivity by reporting both success rate and accuracy.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

Judge vs Human Agreement Benchmark Slice: Kernelbench Metric Slice: success rate Recent High-Signal Papers

Known Limitations

Only 2.3% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (11.4% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (15)
Simulation Env (4)
Human Eval (1)
Llm As Judge (1)

Top Metrics

Success rate (2)
Accuracy (1)
Auroc (1)
Brier score (1)

Top Benchmarks

Kernelbench (1)

Quality Controls

Calibration (1)

Papers In This Archive Slice

Belief-Sim: Towards Belief-Driven Simulation of Demographic Misinformation Susceptibility
Angana Borah, Zohaib Khan, Rada Mihalcea, Verónica Pérez-Rosas · Mar 3, 2026 · Citations: 0

As Large Language Models (LLMs) are increasingly used to simulate human behaviors, we investigate whether they can simulate demographic misinformation susceptibility, treating beliefs as a primary driving factor.
ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer
Chunyuan Deng, Sanket Lokegaonkar, Colin Lockard, Besnik Fetahu, Nasser Zalmout · Mar 3, 2026 · Citations: 0
Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants
Alejandro Breen Herrera, Aayush Sheth, Steven G. Xu, Zhucheng Zhan, Charles Wright · Mar 3, 2026 · Citations: 0
Tucano 2 Cool: Better Open Source LLMs for Portuguese
Nicholas Kluge Corrêa, Aniket Sen, Shiza Fatimah, Sophia Falk, Lennard Landgraf · Mar 3, 2026 · Citations: 0
Code2Math: Can Your Code Agent Effectively Evolve Math Problems Through Exploration?
Dadi Guo, Yuejin Xie, Qingyu Liu, Jiayu Liu, Zhiyuan Fan · Mar 3, 2026 · Citations: 0
Evaluating Performance Drift from Model Switching in Multi-Turn LLM Systems
Raad Khraishi, Iman Zafar, Katie Myles, Greig A Cowan · Mar 3, 2026 · Citations: 0

We introduce a switch-matrix benchmark that measures this effect by running a prefix model for early turns and a suffix model for the final turn, and comparing against the no-switch baseline using paired episode-level bootstrap confidence…
Interpreting Speaker Characteristics in the Dimensions of Self-Supervised Speech Features
Kyle Janse van Rensburg, Benjamin van Niekerk, Herman Kamper · Mar 3, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Compact Prompting in Instruction-tuned LLMs for Joint Argumentative Component Detection
Sofiane Elguendouze, Erwan Hain, Elena Cabrio, Serena Villata · Mar 3, 2026 · Citations: 0

Experiments on standard benchmarks show that our approach achieves higher performance compared to state-of-the-art systems.
TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models
Zhi Xu, Jiaqi Li, Xiaotong Zhang, Hong Yu, Han Liu · Mar 3, 2026 · Citations: 0

Red Team

Large language models (LLMs) have achieved remarkable success across diverse applications but remain vulnerable to jailbreak attacks, where attackers craft prompts that bypass safety alignment and elicit unsafe responses.
TikZilla: Scaling Text-to-TikZ with High-Quality Data and Reinforcement Learning
Christian Greisinger, Steffen Eger · Mar 3, 2026 · Citations: 0

Extensive human evaluations with over 1,000 judgments show that TikZilla improves by 1.5-2 points over its base models on a 5-point scale, surpasses GPT-4o by 0.5 points, and matches GPT-5 in the image-based evaluation, while operating at…
Incremental Graph Construction Enables Robust Spectral Clustering of Texts
Marko Pranjić, Boshko Koloski, Nada Lavrač, Senja Pollak, Marko Robnik-Šikonja · Mar 3, 2026 · Citations: 0

We validate the approach on spectral clustering of SentenceTransformer embeddings using Laplacian eigenmaps across six clustering datasets from the Massive Text Embedding Benchmark.
PrivMedChat: End-to-End Differentially Private RLHF for Medical Dialogue Systems
Sudip Bhujel · Mar 3, 2026 · Citations: 0

Pairwise PreferenceExpert Verification

Conventional supervised fine-tuning and reinforcement learning from human feedback (RLHF) can amplify memorization risks, enabling empirical membership inference and extraction of rare training-set content.
TrustMH-Bench: A Comprehensive Benchmark for Evaluating the Trustworthiness of Large Language Models in Mental Health
Zixin Xiong, Ziteng Wang, Haotian Fan, Xinjie Zhang, Wenxuan Wang · Mar 3, 2026 · Citations: 0

While Large Language Models (LLMs) demonstrate significant potential in providing accessible mental health support, their practical deployment raises critical trustworthiness concerns due to the domains high-stakes and safety-sensitive…
MaBERT:A Padding Safe Interleaved Transformer Mamba Hybrid Encoder for Efficient Extended Context Masked Language Modeling
Jinwoong Kim, Sangjin Park · Mar 3, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Contextualized Privacy Defense for LLM Agents
Yule Wen, Yanzhe Zhang, Jianxun Lian, Xiaoyuan Yi, Xing Xie · Mar 3, 2026 · Citations: 0

Long Horizon

LLM agents increasingly act on users' personal information, yet existing privacy defenses remain limited in both design and adaptability.
ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation
Bo Xu, Haotian Wu, Hehai Lin, Weiquan Huang, Beier Zhu · Mar 3, 2026 · Citations: 0

Extensive experiments on both vision and language benchmarks demonstrate that \acem sets a new state-of-the-art among data-free methods.
Learning to Generate and Extract: A Multi-Agent Collaboration Framework For Zero-shot Document-level Event Arguments Extraction
Guangjun Zhang, Hu Zhang, Yazhou Han, Yue Fan, Yuhang Shao · Mar 3, 2026 · Citations: 0

Multi Agent

Moreover, ensuring the reliability and usability of synthetic data remains a significant challenge due to the absence of quality evaluation mechanisms.
Eval4Sim: An Evaluation Framework for Persona Simulation
Eliseo Bao, Anxo Perez, Xi Wang, Javier Parapar · Mar 3, 2026 · Citations: 0

Large Language Model (LLM) personas with explicit specifications of attributes, background, and behavioural tendencies are increasingly used to simulate human conversations for tasks such as user modeling, social reasoning, and behavioural…
LaTeX Compilation: Challenges in the Era of LLMs
Tianyou Liu, Ziqiang Li, Xurui Liu, Yansong Li · Mar 3, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Nodes Are Early, Edges Are Late: Probing Diagram Representations in Large Vision-Language Models
Haruto Yoshida, Keito Kudo, Yoichi Aoki, Ryota Tanaka, Itsumi Saito · Mar 3, 2026 · Citations: 0

Large vision-language models (LVLMs) demonstrate strong performance on diagram understanding benchmarks, yet they still struggle with understanding relationships between elements, particularly those represented by nodes and directed edges…
The Distribution of Phoneme Frequencies across the World's Languages: Macroscopic and Microscopic Information-Theoretic Models
Fermín Moscoso del Prado Martín, Suchir Salhan · Mar 3, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
A Browser-based Open Source Assistant for Multimodal Content Verification
Rosanna Milner, Michael Foster, Olesya Razuvayevskaya, Ian Roberts, Valentin Porcellini · Mar 3, 2026 · Citations: 0

Web Browsing

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Faster, Cheaper, More Accurate: Specialised Knowledge Tracing Models Outperform LLMs
Prarthana Bhattacharyya, Joshua Mitton, Ralph Abboud, Simon Woodhead · Mar 3, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification
Yichi Zhang, Nabeel Seedat, Yinpeng Dong, Peng Cui, Jun Zhu · Mar 3, 2026 · Citations: 0

Expert Verification Long Horizon

As LLM-powered agents have been used for high-stakes decision-making, such as clinical diagnosis, it becomes critical to develop reliable verification of their decisions to facilitate trustworthy deployment.
OCR or Not? Rethinking Document Information Extraction in the MLLMs Era with Real-World Large-Scale Datasets
Jiyuan Shen, Peiyue Yuan, Atin Ghosh, Yifan Mai, Daniel Dahlmeier · Mar 3, 2026 · Citations: 0

In this paper, we conduct a large-scale benchmarking study that evaluates various out-of-the-box MLLMs on business-document information extraction.
From Solver to Tutor: Evaluating the Pedagogical Intelligence of LLMs with KMP-Bench
Weikang Shi, Houxing Ren, Junting Pan, Aojun Zhou, Ke Wang · Mar 3, 2026 · Citations: 0

Large Language Models (LLMs) show significant potential in AI mathematical tutoring, yet current evaluations often rely on simplistic metrics or narrow pedagogical scenarios, failing to assess comprehensive, multi-turn teaching…
Efficient Self-Evaluation for Diffusion Language Models via Sequence Regeneration
Linhao Zhong, Linyu Wu, Wen Wang, Yuling Xi, Chenchen Jing · Mar 3, 2026 · Citations: 0

However, their non-sequential, bidirectionally masked generation makes quality assessment difficult, underscoring the need for effective self-evaluation.
Sensory-Aware Sequential Recommendation via Review-Distilled Representations
Yeo Chan Yoon · Mar 3, 2026 · Citations: 0

Qualitative analysis further shows that the extracted attributes align closely with human perceptions of products, enabling interpretable connections between natural language descriptions and recommendation behavior.
Graph-GRPO: Stabilizing Multi-Agent Topology Learning via Group Relative Policy Optimization
Yueyang Cang, Xiaoteng Zhang, Erlu Zhao, Zehua Ji, Yuhang Liu · Mar 3, 2026 · Citations: 0

Multi Agent

Optimizing communication topology is fundamental to the efficiency and effectiveness of Large Language Model (LLM)-based Multi-Agent Systems (MAS).
HateMirage: An Explainable Multi-Dimensional Dataset for Decoding Faux Hate and Subtle Online Abuse
Sai Kartheek Reddy Kasu, Shankar Biradar, Sunil Saumya, Md. Shad Akhtar · Mar 3, 2026 · Citations: 0

Subtle and indirect hate speech remains an underexplored challenge in online safety research, particularly when harmful intent is embedded within misleading or manipulative narratives.
ITLC at SemEval-2026 Task 11: Normalization and Deterministic Parsing for Formal Reasoning in LLMs
Wicaksono Leksono Muhamad, Joanito Agili Lopo, Tack Hwa Wong, Muhammad Ravi Shulthan Habibi, Samuel Cahyawijaya · Mar 3, 2026 · Citations: 0

Evaluated on the SemEval-2026 Task 11 multilingual benchmark, our approach achieves top-5 rankings across all subtasks while substantially reducing content effects and offering a competitive alternative to complex fine-tuning or…
Evaluating Cross-Modal Reasoning Ability and Problem Characteristics with Multimodal Item Response Theory
Shunki Uebayashi, Kento Masui, Kyohei Atarashi, Han Bao, Hisashi Kashima · Mar 3, 2026 · Citations: 0

Benchmarks for MLLMs should measure their ability for cross-modal integration.
Real-Time Generation of Game Video Commentary with Multimodal LLMs: Pause-Aware Decoding Approaches
Anum Afzal, Yuki Saito, Hiroya Takamura, Katsuhito Sudoh, Shinnosuke Takamichi · Mar 3, 2026 · Citations: 0

Experiments on Japanese and English datasets of racing and fighting games show that the dynamic interval-based decoding can generate commentary more closely aligned with human utterance timing and content using prompting alone.
Credibility Governance: A Social Mechanism for Collective Self-Correction under Weak Truth Signals
Wanying He, Yanxi Lin, Ziheng Zhou, Xue Feng, Min Peng · Mar 3, 2026 · Citations: 0

We propose Credibility Governance (CG), a mechanism that reallocates influence by learning which agents and viewpoints consistently track evolving public evidence.
StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning
Shiyang Li, Zijian Zhang, Winson Chen, Yuebo Luo, Mingyi Hong · Mar 3, 2026 · Citations: 0

Rubric Rating Multi Agent

To address the challenge, in this work, we propose StitchCUDA, a multi-agent framework for end-to-end GPU program generation, with three specialized agents: a Planner to orchestrate whole system design, a Coder dedicated to implementing it…
Cross-Family Speculative Prefill: Training-Free Long-Context Compression with Small Draft Models
Shubhangi Upasani, Ravi Shanker Raju, Bo Li, Mengmeing Ji, John Long · Mar 3, 2026 · Citations: 0

Prompt length is a major bottleneck in agentic large language model (LLM) workloads, where repeated inference steps and multi-call loops incur substantial prefill cost.
Think, But Don't Overthink: Reproducing Recursive Language Models
Daren Wang · Mar 3, 2026 · Citations: 0

Using state-of-the-art open-source agentic models (DeepSeek v3.2 and Kimi K2), I evaluated pure LLM, RLM (depth=1), and RLM (depth=2) on the S-NIAH and OOLONG benchmarks.
GPUTOK: GPU Accelerated Byte Level BPE Tokenization
Venu Gopal Kadamba, Kanishkha Jaisankar · Mar 3, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
ExpGuard: LLM Content Moderation in Specialized Domains
Minseok Choi, Dongjin Kim, Seungbin Yang, Subin Kim, Youngjun Kwak · Mar 3, 2026 · Citations: 0

Expert Verification

With the growing deployment of large language models (LLMs) in real-world applications, establishing robust safety guardrails to moderate their inputs and outputs has become essential to ensure adherence to safety policies.
How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities
Ziwen Xu, Kewei Xu, Haoming Xu, Haiwen Hong, Longtao Huang · Mar 3, 2026 · Citations: 0

We introduce SteerEval, a hierarchical benchmark for evaluating LLM controllability across three domains: language features, sentiment, and personality.
FlashEvaluator: Expanding Search Space with Parallel Evaluation
Chao Feng, Yuanhao Pu, Chenghao Zhang, Shanqi Liu, Shuchang Liu · Mar 3, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs
Zhiyu Pan, Yizheng Wu, Jiashen Hua, Junyi Feng, Shaotian Yan · Mar 3, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
CoDAR: Continuous Diffusion Language Models are More Powerful Than You Think
Junzhe Shen, Jieru Zhao, Ziwei He, Zhouhan Lin · Mar 3, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models
Zhongxi Wang, Yueqian Lin, Jingyang Zhang, Hai Helen Li, Yiran Chen · Mar 3, 2026 · Citations: 0

Red Team Web Browsing

Safety evaluation and red-teaming of large language models remain predominantly text-centric, and existing frameworks lack the infrastructure to systematically test whether alignment generalizes to audio, image, and video inputs.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote