HFEPX Archive Slice

HFEPX Daily Archive: 2026-03-04

Updated from current HFEPX corpus (Mar 10, 2026). 54 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Mar 10, 2026). 54 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics. Common annotation unit: Pairwise. Frequently cited benchmark: SWE-bench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 4, 2026.

Papers: 54 Last published: Mar 4, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

High-Signal Coverage

100.0%

54 / 54 papers are not low-signal flagged.

Benchmark Anchors

5.6%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

9.3%

Papers with reported metric mentions in extraction output.

0 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Get this digest every Friday →

Why This Time Slice Matters

5.6% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 7.4% of papers in this hub.
SWE-bench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly unspecified rater pools, and annotation is commonly pairwise annotation; use this to scope replication staffing.
Stratify by benchmark (SWE-bench vs AIME) before comparing methods.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

AILS-NTUA at SemEval-2026 Task 12: Graph-Based Retrieval and Reflective Prompting for Abductive Event Reasoning
Mar 4, 2026 · Citations: 0 · Score: 7.5

Eval: Automatic Metrics · Metrics: Accuracy
$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners
Mar 4, 2026 · Citations: 0 · Score: 7.5

Eval: Automatic Metrics · Metrics: Pass@1
AgentIR: Reasoning-Aware Retrieval for Deep Research Agents
Mar 4, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Accuracy
Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development
Mar 4, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy, Agreement
Why Are Linear RNNs More Parallelizable?
Mar 4, 2026 · Citations: 0 · Score: 4.0

Eval: Not reported · Metrics: Precision
From Static Inference to Dynamic Interaction: A Survey of Streaming Large Language Models
Mar 4, 2026 · Citations: 0 · Score: 2.5

Eval: Not reported · Metrics: Not reported

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
AILS-NTUA at SemEval-2026 Task 12: Graph-Based Retrieval and Reflective Prompting for Abductive Event Reasoning Mar 4, 2026	Automatic Metrics	Semeval	Accuracy	Not reported
$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners Mar 4, 2026	Automatic Metrics	SWE Bench, AIME	Pass@1	Not reported
AgentIR: Reasoning-Aware Retrieval for Deep Research Agents Mar 4, 2026	Automatic Metrics	BrowseComp	Accuracy	Not reported
Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development Mar 4, 2026	Automatic Metrics	Not reported	Accuracy, Agreement	Not reported
Why Are Linear RNNs More Parallelizable? Mar 4, 2026	Not reported	Not reported	Precision	Not reported
From Static Inference to Dynamic Interaction: A Survey of Streaming Large Language Models Mar 4, 2026	Not reported	Not reported	Not reported	Not reported
Bielik-Q2-Sharp: A Comparative Study of Extreme 2-bit Quantization Methods for a Polish 11B Language Model Mar 4, 2026	Not reported	Not reported	Not reported	Not reported
Optimizing Language Models for Crosslingual Knowledge Consistency Mar 4, 2026	Not reported	Not reported	Not reported	Not reported
Using Vision + Language Models to Predict Item Difficulty Mar 4, 2026	Not reported	Not reported	Not reported	Not reported
Stan: An LLM-based thermodynamics course assistant Mar 4, 2026	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (5.6% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (18.5% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (44.4% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (0% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (3.7% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Annotation unit is under-specified (3.7% coverage).

Suggested Next Analyses

Stratify by benchmark (SWE-bench vs AIME) before comparing methods.
Track metric sensitivity by reporting both accuracy and agreement.

Recommended Queries

Benchmark Slice: SWE-bench Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (4)

Top Metrics

Accuracy (10)
Agreement (3)
F1 (3)
Pass@1 (3)

Top Benchmarks

SWE Bench (2)
AIME (1)
CodeContests (1)
Driftbench (1)

Quality Controls

Papers In This Archive Slice

Optimizing Language Models for Crosslingual Knowledge Consistency
Tianyu Liu, Jirui Qi, Mrinmaya Sachan, Ryan Cotterell, Raquel Fernández · Mar 4, 2026 · Citations: 0
Using Vision + Language Models to Predict Item Difficulty
Samin Khan · Mar 4, 2026 · Citations: 0
Stan: An LLM-based thermodynamics course assistant
Eric M. Furst, Vasudevan Venkateshwaran · Mar 4, 2026 · Citations: 0
iAgentBench: Benchmarking Sensemaking Capabilities of Information-Seeking Agents on High-Traffic Topics
Preetam Prabhu Srikar Dammu, Arnav Palkhiwala, Tanya Roosta, Chirag Shah · Mar 4, 2026 · Citations: 0
Coordinated Semantic Alignment and Evidence Constraints for Retrieval-Augmented Generation with Large Language Models
Xin Chen, Saili Uday Gadgil, Jiarong Qiu · Mar 4, 2026 · Citations: 0
Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development
Hung Tran, Langston Nashold, Rayan Krishnan, Antoine Bigeard, Alex Gu · Mar 4, 2026 · Citations: 0

Pairwise Preference Web Browsing

We introduce Vibe Code Bench, a benchmark of 100 web application specifications (50 public validation, 50 held-out test) with 964 browser-based workflows comprising 10,131 substeps, evaluated against deployed applications by an autonomous…
Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning
Lei Huang, Xiang Cheng, Chenxiao Zhao, Guobin Shen, Junjie Yang · Mar 4, 2026 · Citations: 0
From Static Inference to Dynamic Interaction: A Survey of Streaming Large Language Models
Junlong Tong, Zilong Wang, YuJie Ren, Peiran Yin, Hao Wu · Mar 4, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Adaptive Memory Admission Control for LLM Agents
Guilin Zhang, Wei Jiang, Xiejiashan Wang, Aisha Behr, Kai Zhao · Mar 4, 2026 · Citations: 0
Still Fresh? Evaluating Temporal Drift in Retrieval Benchmarks
Nathan Kuissi, Suraj Subrahmanyan, Nandan Thakur, Jimmy Lin · Mar 4, 2026 · Citations: 0
AgentIR: Reasoning-Aware Retrieval for Deep Research Agents
Zijian Chen, Xueguang Ma, Shengyao Zhuang, Jimmy Lin, Akari Asai · Mar 4, 2026 · Citations: 0

To exploit this overlooked signal, we introduce: (1) Reasoning-Aware Retrieval, a retrieval paradigm that jointly embeds the agent's reasoning trace alongside its query; and (2) DR-Synth, a data synthesis method that generates Deep Research…
TaxonRL: Reinforcement Learning with Intermediate Rewards for Interpretable Fine-Grained Visual Reasoning
Maximilian von Klinski, Maximilian Schall · Mar 4, 2026 · Citations: 0
$τ$-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge
Quan Shi, Alexandra Zytek, Pedram Razavi, Karthik Narasimhan, Victor Barres · Mar 4, 2026 · Citations: 0
Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks
Haoyu Liu, Dingcheng Li, Lukas Rutishauser, Zeyu Zheng · Mar 4, 2026 · Citations: 0
Pointer-CAD: Unifying B-Rep and Command Sequences via Pointer-based Edges & Faces Selection
Dacheng Qi, Chenyu Wang, Jingwei Xu, Tianzhe Chu, Zibo Zhao · Mar 4, 2026 · Citations: 0
AILS-NTUA at SemEval-2026 Task 12: Graph-Based Retrieval and Reflective Prompting for Abductive Event Reasoning
Nikolas Karafyllis, Maria Lymperaiou, Giorgos Filandrianos, Athanasios Voulodimos, Giorgos Stamou · Mar 4, 2026 · Citations: 0

Pairwise Preference

We present a winning three-stage system for SemEval 2026 Task~12: Abductive Event Reasoning that combines graph-based retrieval, LLM-driven abductive reasoning with prompt design optimized through reflective prompt evolution, and post-hoc…
World Properties without World Models: Recovering Spatial and Temporal Structure from Co-occurrence Statistics in Static Word Embeddings
Elan Barenholtz · Mar 4, 2026 · Citations: 0
$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners
Harman Singh, Xiuyu Li, Kusha Sareen, Monishwaran Maheswaran, Sijun Tan · Mar 4, 2026 · Citations: 0

Pairwise Preference

On code generation (LiveCodeBench, CodeContests, SWE-Bench) and math reasoning (AIME, HMMT) benchmarks, V_1-Infer improves Pass@1 by up to 10% over pointwise verification and outperforms recent test-time scaling methods while being…
The Company You Keep: How LLMs Respond to Dark Triad Traits
Zeyi Lu, Angelica Henestrosa, Pavel Chizhov, Ivan P. Yamshchikov · Mar 4, 2026 · Citations: 0
Position: Vector Prompt Interfaces Should Be Exposed to Enable Customization of Large Language Models
Liangwei Yang, Shiyu Wang, Haolin Chen, Rithesh Murthy, Ming Zhu · Mar 4, 2026 · Citations: 0
Causality Elicitation from Large Language Models
Takashi Kameyama, Masahiro Kato, Yasuko Hio, Yasushi Takano, Naoto Minakawa · Mar 4, 2026 · Citations: 0
Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory
Zhenting Wang, Huancheng Chen, Jiayun Wang, Wei Wei · Mar 4, 2026 · Citations: 0
Retrieval or Representation? Reassessing Benchmark Gaps in Multilingual and Visually Rich RAG
Martin Asenov, Kenza Benkirane, Dan Goldwater, Aneiss Ghodsi · Mar 4, 2026 · Citations: 0
When Do Language Models Endorse Limitations on Human Rights Principles?
Keenan Samway, Nicole Miu Takagi, Rada Mihalcea, Bernhard Schölkopf, Ilias Chalkidis · Mar 4, 2026 · Citations: 0
Code Fingerprints: Disentangled Attribution of LLM-Generated Code
Jiaxun Guo, Ziyuan Yang, Mengyu Sun, Hui Wang, Jingfeng Lu · Mar 4, 2026 · Citations: 0
Bielik-Q2-Sharp: A Comparative Study of Extreme 2-bit Quantization Methods for a Polish 11B Language Model
Jakub Prejzner · Mar 4, 2026 · Citations: 0

We present Bielik-Q2-Sharp, the first systematic academic evaluation of extreme 2-bit quantization applied to a Polish large language model.
Traces of Social Competence in Large Language Models
Tom Kouwenhoven, Michiel van der Meer, Max van Duijn · Mar 4, 2026 · Citations: 0
VietNormalizer: An Open-Source, Dependency-Free Python Library for Vietnamese Text Normalization in TTS and NLP Applications
Hung Vu Nguyen, Loan Do, Thanh Ngoc Nguyen, Ushik Shrestha Khwakhali, Thanh Pham · Mar 4, 2026 · Citations: 0
BeamPERL: Parameter-Efficient RL with Verifiable Rewards Specializes Compact LLMs for Structured Beam Mechanics Reasoning
Tarjei Paule Hage, Markus J. Buehler · Mar 4, 2026 · Citations: 0
FINEST: Improving LLM Responses to Sensitive Topics Through Fine-Grained Evaluation
Juhyun Oh, Nayeon Lee, Chani Jung, Jiho Jin, Junho Myung · Mar 4, 2026 · Citations: 0
Hindsight Quality Prediction Experiments in Multi-Candidate Human-Post-Edited Machine Translation
Malik Marmonier, Benoît Sagot, Rachel Bawden · Mar 4, 2026 · Citations: 0
Monitoring Emergent Reward Hacking During Generation via Internal Activations
Patrick Wilhelm, Thorsten Wittkopp, Odej Kao · Mar 4, 2026 · Citations: 0
Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA
Ikram Belmadani, Oumaima El Khettari, Pacôme Constant dit Beaufils, Richard Dufour, Benoit Favre · Mar 4, 2026 · Citations: 0
Rethinking Role-Playing Evaluation: Anonymous Benchmarking and a Systematic Study of Personality Effects
Ji-Lun Peng, Yun-Nung Chen · Mar 4, 2026 · Citations: 0
From Threat Intelligence to Firewall Rules: Semantic Relations in Hybrid AI Agent and Expert System Architectures
Chiara Bonfanti, Davide Colaiacomo, Luca Cagliero, Cataldo Basile · Mar 4, 2026 · Citations: 0
IROSA: Interactive Robot Skill Adaptation using Natural Language
Markus Knauer, Samuel Bustamante, Thomas Eiband, Alin Albu-Schäffer, Freek Stulp · Mar 4, 2026 · Citations: 0
CzechTopic: A Benchmark for Zero-Shot Topic Localization in Historical Czech Documents
Martin Kostelník, Michal Hradiš, Martin Dočekal · Mar 4, 2026 · Citations: 0
On the Suitability of LLM-Driven Agents for Dark Pattern Audits
Chen Sun, Yash Vekaria, Rishab Nithyanand · Mar 4, 2026 · Citations: 0
Assessing the Effectiveness of LLMs in Delivering Cognitive Behavioral Therapy
Navdeep Singh Bedi, Ana-Maria Bucur, Noriko Kando, Fabio Crestani · Mar 4, 2026 · Citations: 0
Coupling Local Context and Global Semantic Prototypes via a Hierarchical Architecture for Rhetorical Roles Labeling
Anas Belfathi, Nicolas Hernandez, Laura Monceaux, Warren Bonnard, Mary Catherine Lavissiere · Mar 4, 2026 · Citations: 0
Benchmarking Motivational Interviewing Competence of Large Language Models
Aishwariya Jha, Prakrithi Shivaprakash, Lekhansh Shukla, Animesh Mukherjee, Prabhat Chand · Mar 4, 2026 · Citations: 0
Semantic Bridging Domains: Pseudo-Source as Test-Time Connector
Xizhong Yang, Huiming Wang, Ning Xu, Mofei Song · Mar 4, 2026 · Citations: 0
In-Context Environments Induce Evaluation-Awareness in Language Models
Maheep Chaudhary · Mar 4, 2026 · Citations: 0
SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration
Jialong Chen, Xander Xu, Hu Wei, Chuan Chen, Bing Zhao · Mar 4, 2026 · Citations: 0
T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning
Qinsi Wang, Hancheng Ye, Jinhee Kim, Jinghan Ke, Yifei Wang · Mar 4, 2026 · Citations: 0
MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier
Zonglin Yang, Lidong Bing · Mar 4, 2026 · Citations: 0
Confidence-Calibrated Small-Large Language Model Collaboration for Cost-Efficient Reasoning
Chuang Zhang, Zizhen Zhu, Yihao Wei, Bing Tian, Junyi Liu · Mar 4, 2026 · Citations: 0
ErrorLLM: Modeling SQL Errors for Text-to-SQL Refinement
Zijin Hong, Hao Chen, Zheng Yuan, Qinggang Zhang, Luyao Zhuang · Mar 4, 2026 · Citations: 0
Order Is Not Layout: Order-to-Space Bias in Image Generation
Yongkang Zhang, Zonglin Zhao, Yuechen Zhang, Fei Ding, Pei Li · Mar 4, 2026 · Citations: 0
CONCUR: Benchmarking LLMs for Concurrent Code Generation
Jue Huang, Tarek Mahmud, Corina Pasareanu, Guowei Yang · Mar 4, 2026 · Citations: 0
MIND: Unified Inquiry and Diagnosis RL with Criteria Grounded Clinical Supports for Psychiatric Consultation
Guoyi Li, Shihao Xu, Jiatong Ma, Yunyun Han, Jianhua Chen · Mar 4, 2026 · Citations: 0
Linguistically Informed Graph Model and Semantic Contrastive Learning for Korean Short Text Classification
JaeGeon Yoo, Byoungwook Kim, Yeongwook Yang, Hong-Jun Jang · Mar 4, 2026 · Citations: 0
A Neural Topic Method Using a Large-Language-Model-in-the-Loop for Business Research
Stephan Ludwig, Peter J. Danaher, Xiaohao Yang · Mar 4, 2026 · Citations: 0
Why Are Linear RNNs More Parallelizable?
William Merrill, Hongjian Jiang, Yanhong Li, Anthony Lin, Ashish Sabharwal · Mar 4, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote