HFEPX Archive Slice

HFEPX Weekly Archive: 2026-W10

Updated from current HFEPX corpus (Mar 8, 2026). 256 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Mar 8, 2026). 256 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: AIME. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 5, 2026.

Papers: 256 Last published: Mar 5, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

Analysis blocks are computed from the loaded sample (60 of 256 papers).

High-Signal Coverage

100.0%

60 / 60 papers are not low-signal flagged.

Benchmark Anchors

10.0%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

20.0%

Papers with reported metric mentions in extraction output.

1 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.

Why This Time Slice Matters

8.6% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 18% of papers in this hub.
AIME is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is rater calibration (0.8% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts
Mar 5, 2026 · Citations: 0 · Score: 6.5

Eval: Llm As Judge, Automatic Metrics · Metrics: F1, F1 weighted
AILS-NTUA at SemEval-2026 Task 10: Agentic LLMs for Psycholinguistic Marker Extraction and Conspiracy Endorsement Detection
Mar 5, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: F1, F1 macro
Free Lunch for Pass@$k$? Low Cost Diverse Sampling for Diffusion Language Models
Mar 5, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Cost
VRM: Teaching Reward Models to Understand Authentic Human Preferences
Mar 5, 2026 · Citations: 0 · Score: 6.0

Eval: Human Eval · Metrics: Coherence
When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger
Mar 5, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Cost
LocalSUG: Geography-Aware LLM for Query Suggestion in Local-Life Services
Mar 5, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Latency, Relevance

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts Mar 5, 2026	Llm As Judge, Automatic Metrics	Thaisafetybench	F1, F1 weighted	Not reported
AILS-NTUA at SemEval-2026 Task 10: Agentic LLMs for Psycholinguistic Marker Extraction and Conspiracy Endorsement Detection Mar 5, 2026	Automatic Metrics	Semeval	F1, F1 macro	Not reported
Free Lunch for Pass@$k$? Low Cost Diverse Sampling for Diffusion Language Models Mar 5, 2026	Automatic Metrics	GSM8K, HumanEval+	Cost	Not reported
VRM: Teaching Reward Models to Understand Authentic Human Preferences Mar 5, 2026	Human Eval	Not reported	Coherence	Not reported
When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger Mar 5, 2026	Automatic Metrics	Not reported	Cost	Not reported
LocalSUG: Geography-Aware LLM for Query Suggestion in Local-Life Services Mar 5, 2026	Automatic Metrics	Not reported	Latency, Relevance	Not reported
Can LLMs Capture Expert Uncertainty? A Comparative Analysis of Value Alignment in Ethnographic Qualitative Research Mar 5, 2026	Automatic Metrics	Not reported	F1, Agreement	Adjudication
Functionality-Oriented LLM Merging on the Fisher--Rao Manifold Mar 5, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
Replaying pre-training data improves fine-tuning Mar 5, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation Mar 5, 2026	Not reported	Not reported	Throughput, Cost	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (8.6% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (1.6% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (8.2% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (28.5% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (7% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (9.8% vs 35% target).

Strengths

Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.

Known Gaps

Only 1.6% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (7% coverage).
Annotation unit is under-specified (9.8% coverage).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (AIME vs MMLU) before comparing methods.
Track metric sensitivity by reporting both accuracy and latency.

Recommended Queries

Judge vs Human Agreement Benchmark Slice: AIME Metric Slice: accuracy IAA-Reported Evaluations Recent High-Signal Papers

Known Limitations

Only 1.6% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (7% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (46)
Simulation Env (7)
Llm As Judge (5)
Human Eval (3)

Top Metrics

Accuracy (25)
Latency (10)
Cost (9)
F1 (9)

Top Benchmarks

AIME (2)
MMLU (2)
SWE Bench (2)
BBH (1)

Quality Controls

Calibration (2)
Adjudication (1)
Inter Annotator Agreement Reported (1)

Papers In This Archive Slice

RoboPocket: Improve Robot Policies Instantly with Your Phone
Junjie Fang, Wendi Chen, Han Xue, Fangyuan Zhou, Tian Le · Mar 5, 2026 · Citations: 0

Demonstrations Long Horizon

To reconcile this trade-off, we introduce RoboPocket, a portable system that enables Robot-Free Instant Policy Iteration using single consumer smartphones.
POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation
Zeju Qiu, Lixin Liu, Adrian Weller, Han Shi, Weiyang Liu · Mar 5, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks
Shangwen Sun, Alfredo Canziani, Yann LeCun, Jiachen Zhu · Mar 5, 2026 · Citations: 0
Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation
Helena Casademunt, Bartosz Cywiński, Khoi Tran, Arya Jakkli, Samuel Marks · Mar 5, 2026 · Citations: 0
Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
Siddharth Boppana, Annabel Ma, Max Loeffler, Raphael Sarfati, Eric Bigelow · Mar 5, 2026 · Citations: 0
Leveraging LLM Parametric Knowledge for Fact Checking without Retrieval
Artem Vazhentsev, Maria Marina, Daniil Moskovskiy, Sergey Pletenev, Mikhail Seleznyov · Mar 5, 2026 · Citations: 0
NCTB-QA: A Large-Scale Bangla Educational Question Answering Dataset and Benchmarking Performance
Abrar Eyasir, Tahsin Ahmed, Muhammad Ibrahim · Mar 5, 2026 · Citations: 0
DEBISS: a Corpus of Individual, Semi-structured and Spoken Debates
Klaywert Danillo Ferreira de Souza, David Eduardo Pereira, Cláudio E. C. Campelo, Larissa Lucena Vasconcelos · Mar 5, 2026 · Citations: 0
FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling
Ted Zadouri, Markus Hoehnerbach, Jay Shah, Timmy Liu, Vijay Thakkar · Mar 5, 2026 · Citations: 0
Distributed Partial Information Puzzles: Examining Common Ground Construction Under Epistemic Asymmetry
Yifan Zhu, Mariah Bradford, Kenneth Lai, Timothy Obiso, Videep Venkatesha · Mar 5, 2026 · Citations: 0
Ensembling Language Models with Sequential Monte Carlo
Robin Shing Moon Chan, Tianyu Liu, Samuel Kiegeland, Clemente Pasti, Jacob Hoover Vigly · Mar 5, 2026 · Citations: 0
Dissociating Direct Access from Inference in AI Introspection
Harvey Lederman, Kyle Mahowald · Mar 5, 2026 · Citations: 0
An Exploration-Analysis-Disambiguation Reasoning Framework for Word Sense Disambiguation with Low-Parameter LLMs
Deshan Sumanathilaka, Nicholas Micallef, Julian Hough · Mar 5, 2026 · Citations: 0
Progressive Residual Warmup for Language Model Pretraining
Tianhao Chen, Xin Xu, Lu Yin, Hao Chen, Yang Wang · Mar 5, 2026 · Citations: 0
DiSCTT: Consensus-Guided Self-Curriculum for Efficient Test-Time Adaptation in Reasoning
Mohammad Mahdi Moradi, Sudhir Mudur · Mar 5, 2026 · Citations: 0
Exploring the potential and limitations of Model Merging for Multi-Domain Adaptation in ASR
Carlos Carvalho, Francisco Teixeira, Thomas Rolland, Alberto Abad · Mar 5, 2026 · Citations: 0
A Multilingual Human Annotated Corpus of Original and Easy-to-Read Texts to Support Access to Democratic Participatory Processes
Stefan Bott, Verena Riegler, Horacio Saggion, Almudena Rascón Alcaina, Nouran Khallaf · Mar 5, 2026 · Citations: 0
PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration
Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery · Mar 5, 2026 · Citations: 0
Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution
Qiao Jin, Yin Fang, Lauren He, Yifan Yang, Guangzhi Xiong · Mar 5, 2026 · Citations: 0
WavSLM: Single-Stream Speech Language Modeling via WavLM Distillation
Luca Della Libera, Cem Subakan, Mirco Ravanelli · Mar 5, 2026 · Citations: 0
Knowledge Divergence and the Value of Debate for Scalable Oversight
Robin Young · Mar 5, 2026 · Citations: 0

Rlaif Or Synthetic Feedback

AI safety via debate and reinforcement learning from AI feedback (RLAIF) are both proposed methods for scalable oversight of advanced AI systems, yet no formal framework relates them or characterizes when debate offers an advantage.
SarcasmMiner: A Dual-Track Post-Training Framework for Robust Audio-Visual Sarcasm Reasoning
Zhu Li, Yongjian Chen, Huiyuan Lai, Xiyuan Gao, Shekhar Nayak · Mar 5, 2026 · Citations: 0
Oral to Web: Digitizing 'Zero Resource'Languages of Bangladesh
Mohammad Mamun Or Rashid · Mar 5, 2026 · Citations: 0
VietJobs: A Vietnamese Job Advertisement Dataset
Hieu Pham Dinh, Hung Nguyen Huy, Mo El-Haj · Mar 5, 2026 · Citations: 0
Balancing Coverage and Draft Latency in Vocabulary Trimming for Faster Speculative Decoding
Ofir Ben Shoham · Mar 5, 2026 · Citations: 0
Core-based Hierarchies for Efficient GraphRAG
Jakir Hossain, Ahmet Erdem Sarıyüce · Mar 5, 2026 · Citations: 0
Distilling Formal Logic into Neural Spaces: A Kernel Alignment Approach for Signal Temporal Logic
Sara Candussio, Gabriele Sarti, Gaia Saveri, Luca Bortolussi · Mar 5, 2026 · Citations: 0
Diffusion LLMs can think EoS-by-EoS
Sarah Breckner, Sebastian Schuster · Mar 5, 2026 · Citations: 0
Transducing Language Models
Vésteinn Snæbjarnarson, Samuel Kiegeland, Tianyu Liu, Reda Boumasmoud, Ryan Cotterell · Mar 5, 2026 · Citations: 0
Guidelines for the Annotation and Visualization of Legal Argumentation Structures in Chinese Judicial Decisions
Kun Chen, Xianglei Liao, Kaixue Fei, Yi Xing, Xinrui Li · Mar 5, 2026 · Citations: 0
Sparse-BitNet: 1.58-bit LLMs are Naturally Friendly to Semi-Structured Sparsity
Di Zhang, Xun Wu, Shaohan Huang, Yudong Wang, Hanyong Shao · Mar 5, 2026 · Citations: 0
C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning
Avni Mittal, Rauno Arike · Mar 5, 2026 · Citations: 0
Feature Resemblance: On the Theoretical Understanding of Analogical Reasoning in Transformers
Ruichen Xu, Wenjing Yan, Ying-Jun Angela Zhang · Mar 5, 2026 · Citations: 0
Representation Fidelity:Auditing Algorithmic Decisions About Humans Using Self-Descriptions
Theresa Elstner, Martin Potthast · Mar 5, 2026 · Citations: 0
LBM: Hierarchical Large Auto-Bidding Model via Reasoning and Acting
Yewen Li, Zhiyi Lyu, Peng Jiang, Qingpeng Cai, Fei Pan · Mar 5, 2026 · Citations: 0
Measuring the Redundancy of Decoder Layers in SpeechLLMs
Adel Moumen, Guangzhi Sun, Philip C Woodland · Mar 5, 2026 · Citations: 0
ARC-TGI: Human-Validated Task Generators with Reasoning Chain Templates for ARC-AGI
Jens Lehmann, Syeda Khushbakht, Nikoo Salehfard, Nur A Zarin Nishat, Dhananjay Bhandiwad · Mar 5, 2026 · Citations: 0
Aura: Universal Multi-dimensional Exogenous Integration for Aviation Time Series
Jiafeng Lin, Mengren Zheng, Simeng Ye, Yuxuan Wang, Huan Zhang · Mar 5, 2026 · Citations: 0
MUTEX: Leveraging Multilingual Transformers and Conditional Random Fields for Enhanced Urdu Toxic Span Detection
Inayat Arshad, Fajar Saleem, Ijaz Hussain · Mar 5, 2026 · Citations: 0
NeuronMoE: Neuron-Guided Mixture-of-Experts for Efficient Multilingual LLM Extension
Rongzhi Li, Hitomi Yanaka · Mar 5, 2026 · Citations: 0
Survive at All Costs: Exploring LLM's Risky Behaviors under Survival Pressure
Yida Lu, Jianwei Fang, Xuyang Shao, Zixuan Chen, Shiyao Cui · Mar 5, 2026 · Citations: 0
HiFlow: Hierarchical Feedback-Driven Optimization for Constrained Long-Form Text Generation
Yifan Zhu, Guanting Chen, Bing Wei, Haoran Luo · Mar 5, 2026 · Citations: 0
ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts
Trapoom Ukarapol, Nut Chukamphaeng, Kunat Pipatanakul, Pakhapoom Sarapat · Mar 5, 2026 · Citations: 0

Using ThaiSafetyBench, we evaluate 24 LLMs, with GPT-4.1 and Gemini-2.5-Pro serving as LLM-as-a-judge evaluators.
VRM: Teaching Reward Models to Understand Authentic Human Preferences
Biao Liu, Ning Xu, Junming Yang, Hao Xu, Xin Geng · Mar 5, 2026 · Citations: 0

Pairwise Preference

Large Language Models (LLMs) have achieved remarkable success across diverse natural language tasks, yet the reward models employed for aligning LLMs often encounter challenges of reward hacking, where the approaches predominantly rely on…
Functionality-Oriented LLM Merging on the Fisher--Rao Manifold
Jiayu Wang, Zuojun Ye, Wenpeng Yin · Mar 5, 2026 · Citations: 0

Across various benchmarks and collapse diagnostics, our method remains stable as the number and heterogeneity of merged models increase, consistently outperforming prior baselines.
Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation
Yilong Chen, Naibin Gu, Junyuan Shang, Zhenyu Zhang, Yuchen Feng · Mar 5, 2026 · Citations: 0

Long Horizon

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
MPCEval: A Benchmark for Multi-Party Conversation Generation
Minxing Zhang, Yi Yang, Zhuofan Jia, Xuan Yang, Jian Pei · Mar 5, 2026 · Citations: 0

Multi-party conversation generation, such as smart reply and collaborative assistants, is an increasingly important capability of generative AI, yet its evaluation remains a critical bottleneck.
When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger
Amirabbas Afzali, Myeongho Jeon, Maria Brbic · Mar 5, 2026 · Citations: 0

Pairwise Preference

Building on this insight, we propose Confidence-Weighted Preference Optimization (CW-PO), a general framework that re-weights training samples by a weak LLM's confidence and can be applied across different preference optimization…
Replaying pre-training data improves fine-tuning
Suhas Kotha, Percy Liang · Mar 5, 2026 · Citations: 0

Web Browsing

We demonstrate the success of replay in practice for fine-tuning 8B parameter models, improving agentic web navigation success by 4.5\% and Basque question-answering accuracy by 2\%.
VisionPangu: A Compact and Fine-Grained Multimodal Assistant with 1.7B Parameters
Jiaxin Fan, Wenpo Song · Mar 5, 2026 · Citations: 0

By incorporating dense human-authored descriptions from the DOCCI dataset, VisionPangu improves semantic coherence and descriptive richness without relying on aggressive model scaling.
TimeWarp: Evaluating Web Agents by Revisiting the Past
Md Farhan Ishmam, Kenneth Marino · Mar 5, 2026 · Citations: 0

Demonstrations Web Browsing

The improvement of web agents on current benchmarks raises the question: Do today's agents perform just as well when the web changes?
LocalSUG: Geography-Aware LLM for Query Suggestion in Local-Life Services
Jinwen Chen, Shuai Gong, Shiwen Zhang, Zheng Zhang, Yachao Zhao · Mar 5, 2026 · Citations: 0

Pairwise Preference

While LLMs offer strong semantic generalization, deploying them in local-life services introduces three key challenges: lack of geographic grounding, exposure bias in preference optimization, and online inference latency.
Federated Heterogeneous Language Model Optimization for Hybrid Automatic Speech Recognition
Mengze Hong, Yi Gu, Di Jiang, Hanlin Gu, Chen Jason Zhang · Mar 5, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
AILS-NTUA at SemEval-2026 Task 3: Efficient Dimensional Aspect-Based Sentiment Analysis
Stavros Gazetas, Giorgos Filandrianos, Maria Lymperaiou, Paraskevi Tzouveli, Athanasios Voulodimos · Mar 5, 2026 · Citations: 0

Empirical results demonstrate that the proposed models achieve competitive performance and consistently surpass the provided baselines across most evaluation settings.
AILS-NTUA at SemEval-2026 Task 10: Agentic LLMs for Psycholinguistic Marker Extraction and Conspiracy Endorsement Detection
Panagiotis Alexios Spanakis, Maria Lymperaiou, Giorgos Filandrianos, Athanasios Voulodimos, Giorgos Stamou · Mar 5, 2026 · Citations: 0

This paper presents a novel agentic LLM pipeline for SemEval-2026 Task 10 that jointly extracts psycholinguistic conspiracy markers and detects conspiracy endorsement.
Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems
Hiroki Fukui · Mar 5, 2026 · Citations: 0

Multi Agent

We report four preregistered studies (1,584 multi-agent simulations across 16 languages and three model families) demonstrating that alignment interventions in large language models produce a structurally analogous phenomenon: surface…
Can LLMs Capture Expert Uncertainty? A Comparative Analysis of Value Alignment in Ethnographic Qualitative Research
Arina Kostina, Marios Dikaiakos, Alejandro Porcel, Tassos Stassopoulos · Mar 5, 2026 · Citations: 0

In our work we evaluate LLMs on the task of identifying the top three human values expressed in long-form interviews based on the Schwartz Theory of Basic Values framework.
Free Lunch for Pass@$k$? Low Cost Diverse Sampling for Diffusion Language Models
Sean Lamont, Christian Walder, Paul Montague, Amir Dezfouli, Michael Norrish · Mar 5, 2026 · Citations: 0

We evaluate our method on the HumanEval and GSM8K benchmarks using the LLaDA-8B-Instruct model.
FireBench: Evaluating Instruction Following in Enterprise and API-Driven LLM Applications
Yunfan Zhang, Yijie Bei, Jetashree Ravi, Pawel Garbacki · Mar 5, 2026 · Citations: 0

However, existing instruction following benchmarks predominantly evaluate natural language generation constraints that reflect the needs of chat assistants rather than enterprise users.
HACHIMI: Scalable and Controllable Student Persona Generation via Orchestrated Agents
Yilin Jiang, Fei Tan, Xuanyu Yin, Jing Leng, Aimin Zhou · Mar 5, 2026 · Citations: 0

Multi Agent

We formalize this as Theory-Aligned and Distribution-Controllable Persona Generation (TAD-PG) and introduce HACHIMI, a multi-agent Propose-Validate-Revise framework that generates theory-aligned, quota-controlled personas.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote