Daily Archive

HFEPX Weekly Archive: 2026-W02

Updated from current HFEPX corpus (Feb 27, 2026). 10 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Inter Annotator Agreement Reported. Frequently cited benchmark: Retrieval. Common metric signal: relevance. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Jan 11, 2026.

Papers: 10 Last published: Jan 11, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 10 papers for HFEPX Weekly Archive: 2026-W02. Dominant protocol signals include automatic metrics, human evaluation, LLM-as-judge, with frequent benchmark focus on Retrieval, DROP and metric focus on relevance, agreement. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

20% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue , Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching , Mixture-of-Experts as Soft Clustering: A Dual Jacobian-PCA Spectral Geometry Perspective , Neurosymbolic Retrievers for Retrieval-augmented Generation
automatic metrics appears in 90% of papers in this hub.

Evidence: Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching , Mixture-of-Experts as Soft Clustering: A Dual Jacobian-PCA Spectral Geometry Perspective , Neurosymbolic Retrievers for Retrieval-augmented Generation , What Matters For Safety Alignment?
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching , Neurosymbolic Retrievers for Retrieval-augmented Generation , SYNAPSE: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation , Embedding Retrofitting: Data Engineering for better RAG

Protocol Takeaways

1 sampled papers report both human evaluation and LLM-as-judge, supporting direct agreement checks.

Evidence: HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue , Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching , Mixture-of-Experts as Soft Clustering: A Dual Jacobian-PCA Spectral Geometry Perspective , Neurosymbolic Retrievers for Retrieval-augmented Generation
Most common quality-control signal is inter-annotator agreement reporting (10% of papers).

Evidence: HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue , Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching , Mixture-of-Experts as Soft Clustering: A Dual Jacobian-PCA Spectral Geometry Perspective , Neurosymbolic Retrievers for Retrieval-augmented Generation
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.

Evidence: Mixture-of-Experts as Soft Clustering: A Dual Jacobian-PCA Spectral Geometry Perspective , Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching , HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue , Neurosymbolic Retrievers for Retrieval-augmented Generation

Benchmark Interpretation

Retrieval appears in 40% of hub papers (4/10); use this cohort for benchmark-matched comparisons.
DROP appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.

Metric Interpretation

relevance is reported in 20% of hub papers (2/10); compare with a secondary metric before ranking methods.
agreement is reported in 10% of hub papers (1/10); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (20% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (10% vs 30% target).
Maintain strength on Papers naming benchmarks/datasets. Coverage is strong (50% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (50% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (10% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (10% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (20% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (10% vs 30% target).

Papers naming benchmarks/datasets

Coverage is strong (50% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (50% vs 35% target).

Papers with known rater population

Coverage is a replication risk (10% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (10% vs 35% target).

Known Limitations

Only 10% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (10% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Judge vs Human Agreement - Compares papers that evaluate with both human raters and LLM judges.
Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: relevance - Finds papers where reported metrics are directly comparable.
IAA-Reported Evaluations - Highlights evaluations that explicitly report inter-annotator agreement.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs llm_as_judge

both=1, left_only=0, right_only=0

1 papers use both Human Eval and Llm As Judge.

human_eval vs automatic_metrics

both=0, left_only=1, right_only=9

0 papers use both Human Eval and Automatic Metrics.

llm_as_judge vs automatic_metrics

both=0, left_only=1, right_only=9

0 papers use both Llm As Judge and Automatic Metrics.

Benchmark Brief

Retrieval

Coverage: 4 papers (40%)

4 papers (40%) mention Retrieval.

Examples: Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching , Neurosymbolic Retrievers for Retrieval-augmented Generation , SYNAPSE: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation

Benchmark Brief

DROP

Coverage: 1 papers (10%)

1 papers (10%) mention DROP.

Examples: Stratified Hazard Sampling: Minimal-Variance Event Scheduling for CTMC/DTMC Discrete Diffusion and Flow Models

Benchmark Brief

Medieval

Coverage: 1 papers (10%)

1 papers (10%) mention Medieval.

Examples: Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching

Metric Brief

relevance

Coverage: 2 papers (20%)

2 papers (20%) mention relevance.

Examples: SYNAPSE: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation , The Invisible Hand of AI Libraries Shaping Open Source Projects and Communities

Metric Brief

agreement

Coverage: 1 papers (10%)

1 papers (10%) mention agreement.

Examples: HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue

Metric Brief

jailbreak success rate

Coverage: 1 papers (10%)

1 papers (10%) mention jailbreak success rate.

Examples: What Matters For Safety Alignment?

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching , Mixture-of-Experts as Soft Clustering: A Dual Jacobian-PCA Spectral Geometry Perspective , HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Papers Published On This Date

Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching
Stephen Gadd · Jan 11, 2026

Linking names across historical sources, languages, and writing systems remains a fundamental challenge in digital humanities and geographic information retrieval.
Mixture-of-Experts as Soft Clustering: A Dual Jacobian-PCA Spectral Geometry Perspective
Feilong Liu · Jan 9, 2026

Mixture-of-Experts (MoE) architectures are widely used for efficiency and conditional computation, but their effect on the geometry of learned functions and representations remains poorly understood.
HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue
Laya Iyer, Kriti Aggarwal, Sanmi Koyejo, Gail Heyman, Desmond C. Ong · Jan 9, 2026

Pairwise PreferenceRubric Rating

Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans.
Neurosymbolic Retrievers for Retrieval-augmented Generation
Yash Saxena, Manas Gaur · Jan 8, 2026

Retrieval Augmented Generation (RAG) has made significant strides in overcoming key limitations of large language models, such as hallucination, lack of contextual grounding, and issues with transparency.
What Matters For Safety Alignment?
Xing Li, Hui-Ling Zhen, Lihao Yin, Xianzhi Yu, Zhenhua Dong · Jan 7, 2026

Red Team Tool Use

This paper presents a comprehensive empirical study on the safety alignment capabilities.
Stratified Hazard Sampling: Minimal-Variance Event Scheduling for CTMC/DTMC Discrete Diffusion and Flow Models
Seunghwan Jang, SooJean Han · Jan 6, 2026

Uniform-noise discrete diffusion and flow models (e.g., D3PM, SEDD, UDLM, DFM) generate sequences non-autoregressively by iteratively refining randomly initialized vocabulary tokens through multiple context-dependent replacements.
SYNAPSE: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation
Hanqi Jiang, Junhao Chen, Yi Pan, Ling Chen, Weihang You · Jan 6, 2026

While Large Language Models (LLMs) excel at generalized reasoning, standard retrieval-augmented approaches fail to address the disconnected nature of long-term agentic memory.
Embedding Retrofitting: Data Engineering for better RAG
Anantha Sharma · Jan 6, 2026

Embedding retrofitting adjusts pre-trained word vectors using knowledge graph constraints to improve domain-specific retrieval.
The Invisible Hand of AI Libraries Shaping Open Source Projects and Communities
Matteo Esposito, Andrea Janes, Valentina Lenarduzzi, Davide Taibi · Jan 5, 2026

In the early 1980s, Open Source Software emerged as a revolutionary concept amidst the dominance of proprietary software.
CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving
Shuhang Chen, Yunqiu Xu, Junjie Xie, Aojun Lu, Tao Feng · Jan 5, 2026

Motivated by this, we present CogFlow, a novel cognitive-inspired three-stage framework that incorporates a knowledge internalization stage, explicitly simulating the hierarchical flow of human reasoning: perception$\Rightarrow$internalizat

Recent Daily Archives

fortnight-2026-f04 (335) week-2026-w08 (288) quarter-2026-q1 (732) month-2026-02 (682) fortnight-2026-f05 (323) week-2026-w09 (323) 2026-02-23 (56) 2026-02-25 (89) 2026-02-24 (118) month-2026-01 (50) 2026-02-20 (37) week-2026-w07 (47) 2026-02-26 (60) quarter-2025-q4 (124) 2026-02-18 (56) 2026-02-21 (22) 2026-02-17 (54) 2026-02-19 (60) fortnight-2026-f02 (27) 2026-02-22 (23) 2026-02-16 (36) month-2025-10 (69) quarter-2025-q2 (78) fortnight-2026-f01 (16) fortnight-2026-f03 (34) quarter-2025-q3 (90) week-2026-w06 (23) month-2025-12 (30) fortnight-2025-f22 (32) quarter-2025-q1 (35) month-2025-06 (39) month-2025-09 (44) month-2025-11 (25) week-2026-w03 (15) week-2026-w04 (12) 2026-02-13 (8) fortnight-2025-f21 (32) 2026-02-15 (7) fortnight-2025-f20 (34) fortnight-2025-f12 (29)

HFEPX Weekly Archive: 2026-W02

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Papers Published On This Date

Recent Daily Archives