Daily Archive

HFEPX Fortnight Archive: 2026-F02

Updated from current HFEPX corpus (Feb 27, 2026). 27 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Multi Dim Rubric. Frequent quality control: Calibration. Frequently cited benchmark: Retrieval. Common metric signal: f1. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Jan 24, 2026.

Papers: 27 Last published: Jan 24, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 27 papers for HFEPX Fortnight Archive: 2026-F02. Dominant protocol signals include automatic metrics, simulation environments, human evaluation, with frequent benchmark focus on Retrieval, DocVQA and metric focus on f1, latency. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

29.6% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization , RebuttalAgent: Strategic Persuasion in Academic Rebuttal via Theory of Mind , Building Safe and Deployable Clinical Natural Language Processing under Temporal Leakage Constraints , Large Language Models as Automatic Annotators and Annotation Adjudicators for Fine-Grained Opinion Analysis
automatic metrics appears in 77.8% of papers in this hub.

Evidence: Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization , Building Safe and Deployable Clinical Natural Language Processing under Temporal Leakage Constraints , PhysE-Inv: A Physics-Encoded Inverse Modeling approach for Arctic Snow Depth Prediction , Between Search and Platform: ChatGPT Under the DSA
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: AWED-FiNER: Agents, Web applications, and Expert Detectors for Fine-grained Named Entity Recognition across 36 Languages for 6.6 Billion Speakers , Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization , Building Safe and Deployable Clinical Natural Language Processing under Temporal Leakage Constraints , Large Language Models as Automatic Annotators and Annotation Adjudicators for Fine-Grained Opinion Analysis

Protocol Takeaways

Most common quality-control signal is rater calibration (7.4% of papers).

Evidence: Building Safe and Deployable Clinical Natural Language Processing under Temporal Leakage Constraints , Large Language Models as Automatic Annotators and Annotation Adjudicators for Fine-Grained Opinion Analysis , Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization , PhysE-Inv: A Physics-Encoded Inverse Modeling approach for Arctic Snow Depth Prediction
Rater context is mostly domain experts, and annotation is commonly multi-dimensional rubrics; use this to scope replication staffing.

Evidence: Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization , APEX-Agents , AWED-FiNER: Agents, Web applications, and Expert Detectors for Fine-grained Named Entity Recognition across 36 Languages for 6.6 Billion Speakers , Building Safe and Deployable Clinical Natural Language Processing under Temporal Leakage Constraints
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Evidence: Large Language Models as Automatic Annotators and Annotation Adjudicators for Fine-Grained Opinion Analysis , RebuttalAgent: Strategic Persuasion in Academic Rebuttal via Theory of Mind , Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization , Building Safe and Deployable Clinical Natural Language Processing under Temporal Leakage Constraints

Benchmark Interpretation

Retrieval appears in 7.4% of hub papers (2/27); use this cohort for benchmark-matched comparisons.
DocVQA appears in 3.7% of hub papers (1/27); use this cohort for benchmark-matched comparisons.

Metric Interpretation

f1 is reported in 11.1% of hub papers (3/27); compare with a secondary metric before ranking methods.
latency is reported in 7.4% of hub papers (2/27); compare with a secondary metric before ranking methods.

Researcher Checklist

Tighten coverage on Papers with explicit human feedback. Coverage is usable but incomplete (29.6% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (11.1% vs 30% target).
Maintain strength on Papers naming benchmarks/datasets. Coverage is strong (37% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (37% vs 35% target).
Tighten coverage on Papers with known rater population. Coverage is usable but incomplete (22.2% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (14.8% vs 35% target).

Papers with explicit human feedback

Coverage is usable but incomplete (29.6% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (11.1% vs 30% target).

Papers naming benchmarks/datasets

Coverage is strong (37% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (37% vs 35% target).

Papers with known rater population

Coverage is usable but incomplete (22.2% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (14.8% vs 35% target).

Known Limitations

Only 11.1% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (22.2% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Human Eval Protocols - Surfaces human-rating workflows for rubric and annotator quality analysis.
Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: f1 - Finds papers where reported metrics are directly comparable.
IAA-Reported Evaluations - Highlights evaluations that explicitly report inter-annotator agreement.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs automatic_metrics

both=0, left_only=2, right_only=21

0 papers use both Human Eval and Automatic Metrics.

automatic_metrics vs simulation_env

both=0, left_only=21, right_only=4

0 papers use both Automatic Metrics and Simulation Env.

simulation_env vs human_eval

both=0, left_only=4, right_only=2

0 papers use both Simulation Env and Human Eval.

Benchmark Brief

Retrieval

Coverage: 2 papers (7.4%)

2 papers (7.4%) mention Retrieval.

Examples: AWED-FiNER: Agents, Web applications, and Expert Detectors for Fine-grained Named Entity Recognition across 36 Languages for 6.6 Billion Speakers , Context Volume Drives Performance: Tackling Domain Shift in Extremely Low-Resource Translation via RAG

Benchmark Brief

DocVQA

Coverage: 1 papers (3.7%)

1 papers (3.7%) mention DocVQA.

Examples: Chain-of-Thought Compression Should Not Be Blind: V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring

Benchmark Brief

GAIA

Coverage: 1 papers (3.7%)

1 papers (3.7%) mention GAIA.

Examples: CLiMB: A Domain-Informed Novelty Detection Clustering Framework for Galactic Archaeology and Scientific Discovery

Metric Brief

Coverage: 3 papers (11.1%)

3 papers (11.1%) mention f1.

Examples: Human Values in a Single Sentence: Moral Presence, Hierarchies, and Transformer Ensembles on the Schwartz Continuum , Event Detection with a Context-Aware Encoder and LoRA for Improved Performance on Long-Tailed Classes , CAST: Character-and-Scene Episodic Memory for Agents

Metric Brief

latency

Coverage: 2 papers (7.4%)

2 papers (7.4%) mention latency.

Examples: Chain-of-Thought Compression Should Not Be Blind: V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring , Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning

Metric Brief

recall

Coverage: 2 papers (7.4%)

2 papers (7.4%) mention recall.

Examples: Human Values in a Single Sentence: Moral Presence, Hierarchies, and Transformer Ensembles on the Schwartz Continuum , CAST: Character-and-Scene Episodic Memory for Agents

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization , Building Safe and Deployable Clinical Natural Language Processing under Temporal Leakage Constraints , Large Language Models as Automatic Annotators and Annotation Adjudicators for Fine-Grained Opinion Analysis

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Papers: Cross-Cultural Expert-Level Art Critique Evaluation with Vision-Language Models

Papers Published On This Date

Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization
Jingyi Xu, Xingyu Ren, Zhoupeng Shou, Yumeng Zhang, Zhiqiang You · Jan 24, 2026

Pairwise Preference Long Horizon

Large language models show potential in task-oriented dialogue systems, yet existing training methods often rely on token-level likelihood or preference optimization, which poorly align with long-horizon task success.
Building Safe and Deployable Clinical Natural Language Processing under Temporal Leakage Constraints
Ha Na Cho, Sairam Sutari, Alexander Lopez, Hansen Bow, Kai Zheng · Jan 24, 2026

Such behavior poses substantial risks for real-world deployment, where overconfident or temporally invalid predictions can disrupt clinical workflows and compromise patient safety.
Large Language Models as Automatic Annotators and Annotation Adjudicators for Fine-Grained Opinion Analysis
Gaurav Negi, MA Waskow, John McCrae, Paul Buitelaar · Jan 23, 2026

Although this level of detail is sound, it requires considerable human effort and substantial cost to annotate opinions in datasets for training models, especially across diverse domains and real-world applications.
PhysE-Inv: A Physics-Encoded Inverse Modeling approach for Arctic Snow Depth Prediction
Akila Sampath, Vandana Janeja, Jianwu Wang · Jan 23, 2026

The accurate estimation of Arctic snow depth remains a critical time-varying inverse problem due to the scarcity in associated sea ice parameters.
Between Search and Platform: ChatGPT Under the DSA
Toni Lorente, Kathrin Gardhouse · Jan 22, 2026

Web Browsing

This article examines the applicability of the Digital Services Act (DSA) to ChatGPT, arguing that it should be classified as a hybrid of the two types of hosting services: online search engines and platforms.
ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models
Shir Ashury-Tahan, Yifan Mai, Elron Bandel, Michal Shmueli-Scheuer, Leshem Choshen · Jan 22, 2026

Large Language Models (LLM) benchmarks tell us when models fail, but not why they fail.
RebuttalAgent: Strategic Persuasion in Academic Rebuttal via Theory of Mind
Zhitao He, Zongwei Lyu, Yi R Fung · Jan 22, 2026

Pairwise PreferenceCritique Edit

In this paper, we introduce RebuttalAgent, the first framework to ground academic rebuttal in Theory of Mind (ToM), operationalized through a ToM-Strategy-Response (TSR) framework that models reviewer mental state, formulates persuasion str
APEX-Agents
Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman · Jan 20, 2026

Rubric RatingExpert Verification Long Horizon

We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate law
Human Values in a Single Sentence: Moral Presence, Hierarchies, and Transformer Ensembles on the Schwartz Continuum
Víctor Yeste, Paolo Rosso · Jan 20, 2026

We study sentence-level detection of the 19 human values in the refined Schwartz continuum in about 74k English sentences from news and political manifestos (ValueEval'24 corpus).
Chain-of-Thought Compression Should Not Be Blind: V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring
Dongxu Zhang, Yiding Sun, Cheng Tan, Wenbiao Yan, Ning Yang · Jan 20, 2026

While Chain-of-Thought (CoT) reasoning significantly enhances the performance of Multimodal Large Language Models (MLLMs), its autoregressive nature incurs prohibitive latency constraints.
When LLMs Imagine People: A Human-Centered Persona Brainstorm Audit for Bias and Fairness in Creative Applications
Hongliu Cao, Eoin Thomas, Rodrigo Acuna Agost · Jan 19, 2026

Existing methods rely on constrained tasks and fixed benchmarks, leaving open-ended creative outputs unexamined.
Multimodal Multi-Agent Empowered Legal Judgment Prediction
Zhaolu Kang, Junhao Gong, Qingxi Chen, Hao Zhang, Jiaxin Liu · Jan 19, 2026

Multi Agent

Furthermore, we build JurisMM, a large dataset with over 100,000 recent Chinese judicial records, including both text and multimodal video-text data, enabling comprehensive evaluation.
Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space
Wang Zixian · Jan 18, 2026

Long Horizon

Experiments on MATH benchmarks show that the Hilbert projection formulation prevents gradient saturation typical of KL-constrained methods.
Event Detection with a Context-Aware Encoder and LoRA for Improved Performance on Long-Tailed Classes
Abdullah Al Monsur, Nitesh Vamshi Bommisetty, Gene Louis Kim · Jan 17, 2026

The current state of event detection research has two notable re-occurring limitations that we investigate in this study.
Generating metamers of human scene understanding
Ritik Raina, Abe Leite, Alexandros Graikos, Seoyoung Ahn, Dimitris Samaras · Jan 16, 2026

Human vision combines low-resolution "gist" information from the visual periphery with sparse but high-resolution information from fixated locations to construct a coherent understanding of a visual scene.
Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering
Xinyu Zhu, Yuzhu Cai, Zexi Liu, Bingyang Zheng, Cheng Wang · Jan 15, 2026

Long Horizon

The advancement of artificial intelligence toward agentic science is currently bottlenecked by the challenge of ultra-long-horizon autonomy, the ability to sustain strategic coherence and iterative correction over experimental cycles spanni
AWED-FiNER: Agents, Web applications, and Expert Detectors for Fine-grained Named Entity Recognition across 36 Languages for 6.6 Billion Speakers
Prachuryya Kaushik, Ashish Anand · Jan 15, 2026

We introduce \textbf{AWED-FiNER}, an open-source collection of agentic tool, web application, and 53 state-of-the-art expert models that provide Fine-grained Named Entity Recognition (FgNER) solutions across 36 languages spoken by more than
Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment
Cameron Tice, Puria Radmard, Samuel Ratnam, Andy Kim, David Africa · Jan 15, 2026

We share our models, data, and evaluations at AlignmentPretraining.ai.
Context Volume Drives Performance: Tackling Domain Shift in Extremely Low-Resource Translation via RAG
David Samuel Setiawan, Raphaël Merx, Jey Han Lau · Jan 15, 2026

Qualitative analysis confirms the LLM acts as a robust "safety net," repairing severe failures in zero-shot domains.
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz · Jan 14, 2026

Pairwise Preference Long Horizon

Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodie
CLiMB: A Domain-Informed Novelty Detection Clustering Framework for Galactic Archaeology and Scientific Discovery
Lorenzo Monti, Tatiana Muraveva, Brian Sheridan, Davide Massari, Alessia Garofalo · Jan 14, 2026

In data-driven scientific discovery, a challenge lies in classifying well-characterized phenomena while identifying novel anomalies.
CAST: Character-and-Scene Episodic Memory for Agents
Kexin Ma, Bojun Li, Yuhua Tang, Liting Sun, Ruochun Jin · Jan 14, 2026

Episodic memory is a central component of human memory, which refers to the ability to recall coherent events grounded in who, when, and where.
A Parallel Cross-Lingual Benchmark for Multimodal Idiomaticity Understanding
Dilara Torunoğlu-Selamet, Dogukan Arslan, Rodrigo Wilkens, Wei He, Doruk Eryiğit · Jan 13, 2026

Pairwise Preference

The dataset, containing 34 languages and over ten thousand items, allows comparative analyses of idiomatic patterns among language-specific realisations and preferences in order to gather insights about shared cultural aspects.
FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures
Jifeng Song, Arun Das, Pan Wang, Hui Ji, Kun Zhao · Jan 12, 2026

To support high-quality supervision, we curate BioSci-Fig-Cap, a refined benchmark for panel-level grounding, alongside cross-disciplinary test suites in physics and chemistry.
VULCA-Bench: A Multicultural Vision-Language Benchmark for Evaluating Cultural Understanding
Haorui Yu, Diji Yang, Hang He, Fengrui Zhang, Qiufeng Yi · Jan 12, 2026

Critique Edit

We introduce VULCA-Bench, a multicultural art-critique benchmark for evaluating Vision-Language Models' (VLMs) cultural understanding beyond surface-level visual perception.
Cross-Cultural Expert-Level Art Critique Evaluation with Vision-Language Models
Haorui Yu, Xuehang Wen, Fengrui Zhang, Qiufeng Yi · Jan 12, 2026

Rubric RatingCritique Edit

Existing benchmarks assess perception without interpretation, and common evaluation proxies, such as automated metrics and LLM-judge averaging, are unreliable for culturally sensitive generative tasks.
Reward Modeling from Natural Language Human Feedback
Zongqi Wang, Rui Wang, Yuchuan Wu, Yiyao Yu, Pinyi Zhang · Jan 12, 2026

Pairwise PreferenceCritique Edit

Reinforcement Learning with Verifiable reward (RLVR) on preference data has become the mainstream approach for training Generative Reward Models (GRMs).

Recent Daily Archives

fortnight-2026-f04 (335) week-2026-w08 (288) quarter-2026-q1 (732) month-2026-02 (682) fortnight-2026-f05 (323) week-2026-w09 (323) 2026-02-23 (56) 2026-02-25 (89) 2026-02-24 (118) month-2026-01 (50) 2026-02-20 (37) week-2026-w07 (47) 2026-02-26 (60) quarter-2025-q4 (124) 2026-02-18 (56) 2026-02-21 (22) 2026-02-17 (54) 2026-02-19 (60) 2026-02-22 (23) 2026-02-16 (36) month-2025-10 (69) quarter-2025-q2 (78) fortnight-2026-f01 (16) fortnight-2026-f03 (34) quarter-2025-q3 (90) week-2026-w06 (23) month-2025-12 (30) fortnight-2025-f22 (32) quarter-2025-q1 (35) month-2025-06 (39) month-2025-09 (44) month-2025-11 (25) week-2026-w03 (15) week-2026-w04 (12) 2026-02-13 (8) fortnight-2025-f21 (32) 2026-02-15 (7) fortnight-2025-f20 (34) fortnight-2025-f12 (29) week-2025-w39 (21)

HFEPX Fortnight Archive: 2026-F02

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Papers Published On This Date

Recent Daily Archives