Daily Archive

HFEPX Weekly Archive: 2026-W01

Updated from current HFEPX corpus (Feb 27, 2026). 6 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Calibration. Frequently cited benchmark: Needle In A Haystack. Common metric signal: cost. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Jan 3, 2026.

Papers: 6 Last published: Jan 3, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 6 papers for HFEPX Weekly Archive: 2026-W01. Dominant protocol signals include automatic metrics, with frequent benchmark focus on Needle In A Haystack, Retrieval and metric focus on cost, accuracy. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

16.7% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: ARGUS: Adaptive Rotation-Invariant Geometric Unsupervised System , Improving Variational Autoencoder using Random Fourier Transformation: An Aviation Safety Anomaly Detection Case-Study , Fast-weight Product Key Memory , RAIR: A Rule-Aware Benchmark Uniting Challenging Long-Tail and Visual Salience Subset for E-commerce Relevance Assessment
automatic metrics appears in 100% of papers in this hub.

Evidence: ARGUS: Adaptive Rotation-Invariant Geometric Unsupervised System , Improving Variational Autoencoder using Random Fourier Transformation: An Aviation Safety Anomaly Detection Case-Study , Fast-weight Product Key Memory , RAIR: A Rule-Aware Benchmark Uniting Challenging Long-Tail and Visual Salience Subset for E-commerce Relevance Assessment
Needle In A Haystack is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: ARGUS: Adaptive Rotation-Invariant Geometric Unsupervised System , Improving Variational Autoencoder using Random Fourier Transformation: An Aviation Safety Anomaly Detection Case-Study , Fast-weight Product Key Memory , RAIR: A Rule-Aware Benchmark Uniting Challenging Long-Tail and Visual Salience Subset for E-commerce Relevance Assessment

Protocol Takeaways

Most common quality-control signal is rater calibration (16.7% of papers).

Evidence: WISE: Web Information Satire and Fakeness Evaluation , ARGUS: Adaptive Rotation-Invariant Geometric Unsupervised System , Improving Variational Autoencoder using Random Fourier Transformation: An Aviation Safety Anomaly Detection Case-Study , Fast-weight Product Key Memory
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.

Evidence: Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss , ARGUS: Adaptive Rotation-Invariant Geometric Unsupervised System , Improving Variational Autoencoder using Random Fourier Transformation: An Aviation Safety Anomaly Detection Case-Study , Fast-weight Product Key Memory
Stratify by benchmark (Needle In A Haystack vs Retrieval) before comparing methods.

Evidence: ARGUS: Adaptive Rotation-Invariant Geometric Unsupervised System , Improving Variational Autoencoder using Random Fourier Transformation: An Aviation Safety Anomaly Detection Case-Study , Fast-weight Product Key Memory , RAIR: A Rule-Aware Benchmark Uniting Challenging Long-Tail and Visual Salience Subset for E-commerce Relevance Assessment

Benchmark Interpretation

Needle In A Haystack appears in 16.7% of hub papers (1/6); use this cohort for benchmark-matched comparisons.
Retrieval appears in 16.7% of hub papers (1/6); use this cohort for benchmark-matched comparisons.

Metric Interpretation

cost is reported in 33.3% of hub papers (2/6); compare with a secondary metric before ranking methods.
accuracy is reported in 16.7% of hub papers (1/6); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (16.7% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (16.7% vs 30% target).
Close gap on Papers naming benchmarks/datasets. Coverage is a replication risk (16.7% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (66.7% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (16.7% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (16.7% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (16.7% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (16.7% vs 30% target).

Papers naming benchmarks/datasets

Coverage is a replication risk (16.7% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (66.7% vs 35% target).

Papers with known rater population

Coverage is a replication risk (16.7% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (16.7% vs 35% target).

Known Limitations

Only 16.7% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (16.7% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: Needle In A Haystack - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: cost - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

Benchmark Brief

Needle In A Haystack

Coverage: 1 papers (16.7%)

1 papers (16.7%) mention Needle In A Haystack.

Examples: Fast-weight Product Key Memory

Benchmark Brief

Retrieval

Coverage: 1 papers (16.7%)

1 papers (16.7%) mention Retrieval.

Examples: Fast-weight Product Key Memory

Metric Brief

cost

Coverage: 2 papers (33.3%)

2 papers (33.3%) mention cost.

Examples: Fast-weight Product Key Memory , Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

Metric Brief

accuracy

Coverage: 1 papers (16.7%)

1 papers (16.7%) mention accuracy.

Examples: WISE: Web Information Satire and Fakeness Evaluation

Metric Brief

auc

Coverage: 1 papers (16.7%)

1 papers (16.7%) mention auc.

Examples: WISE: Web Information Satire and Fakeness Evaluation

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: ARGUS: Adaptive Rotation-Invariant Geometric Unsupervised System , Improving Variational Autoencoder using Random Fourier Transformation: An Aviation Safety Anomaly Detection Case-Study , Fast-weight Product Key Memory

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Papers Published On This Date

ARGUS: Adaptive Rotation-Invariant Geometric Unsupervised System
Anantha Sharma · Jan 3, 2026

Pairwise Preference

Detecting distributional drift in high-dimensional data streams presents fundamental challenges: global comparison methods scale poorly, projection-based approaches lose geometric structure, and re-clustering methods suffer from identity in
Improving Variational Autoencoder using Random Fourier Transformation: An Aviation Safety Anomaly Detection Case-Study
Ata Akbari Asanjan, Milad Memarzadeh, Bryan Matthews, Nikunj Oza · Jan 3, 2026

We showcase our findings with two low-dimensional synthetic datasets for data representation, and an aviation safety dataset, called Dashlink, for high-dimensional reconstruction-based anomaly detection.
Fast-weight Product Key Memory
Tianyu Zhao, Llion Jones · Jan 2, 2026

Notably, in Needle-in-a-Haystack evaluations, FwPKM generalizes to 128K-token contexts despite being trained on only 4K-token sequences.
RAIR: A Rule-Aware Benchmark Uniting Challenging Long-Tail and Visual Salience Subset for E-commerce Relevance Assessment
Chenji Lu, Zhuo Chen, Hui Zhao, Zhenyi Wang, Pengjie Wang · Dec 31, 2025

While large language models (LLMs) have shown significant results on relevance task, existing benchmarks lack sufficient complexity for comprehensive model assessment, resulting in an absence of standardized relevance evaluation metrics acr
WISE: Web Information Satire and Fakeness Evaluation
Gaurab Chhetri, Subasish Das, Tausif Islam Chowdhury · Dec 30, 2025

This study develops WISE (Web Information Satire and Fakeness Evaluation) framework which benchmarks eight lightweight transformer models alongside two baseline models on a balanced dataset of 20,000 samples from Fakeddit, annotated as eith
Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss
Ang Lv, Jin Ma, Yiyuan Ma, Siyuan Qiao · Dec 29, 2025

Mixture-of-Experts (MoE) models lack explicit constraints to ensure the router's decisions align well with the experts' capabilities, which ultimately limits model performance.

Recent Daily Archives

fortnight-2026-f04 (335) week-2026-w08 (288) quarter-2026-q1 (732) month-2026-02 (682) fortnight-2026-f05 (323) week-2026-w09 (323) 2026-02-23 (56) 2026-02-25 (89) 2026-02-24 (118) month-2026-01 (50) 2026-02-20 (37) week-2026-w07 (47) 2026-02-26 (60) quarter-2025-q4 (124) 2026-02-18 (56) 2026-02-21 (22) 2026-02-17 (54) 2026-02-19 (60) fortnight-2026-f02 (27) 2026-02-22 (23) 2026-02-16 (36) month-2025-10 (69) quarter-2025-q2 (78) fortnight-2026-f01 (16) fortnight-2026-f03 (34) quarter-2025-q3 (90) week-2026-w06 (23) month-2025-12 (30) fortnight-2025-f22 (32) quarter-2025-q1 (35) month-2025-06 (39) month-2025-09 (44) month-2025-11 (25) week-2026-w03 (15) week-2026-w04 (12) 2026-02-13 (8) fortnight-2025-f21 (32) 2026-02-15 (7) fortnight-2025-f20 (34) fortnight-2025-f12 (29)

HFEPX Weekly Archive: 2026-W01

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Papers Published On This Date

Recent Daily Archives