Daily Archive

HFEPX Monthly Archive: 2026-01

Updated from current HFEPX corpus (Feb 27, 2026). 50 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Calibration. Frequently cited benchmark: Retrieval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Jan 31, 2026.

Papers: 50 Last published: Jan 31, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 50 papers for HFEPX Monthly Archive: 2026-01. Dominant protocol signals include automatic metrics, simulation environments, human evaluation, with frequent benchmark focus on Retrieval, ALFWorld and metric focus on accuracy, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

22% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization , RebuttalAgent: Strategic Persuasion in Academic Rebuttal via Theory of Mind , From Associations to Activations: Comparing Behavioral and Hidden-State Semantic Geometry in LLMs , Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs
automatic metrics appears in 84% of papers in this hub.

Evidence: From Associations to Activations: Comparing Behavioral and Hidden-State Semantic Geometry in LLMs , Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs , KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models , Indic-TunedLens: Interpreting Multilingual Models in Indian Languages
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Embodied Task Planning via Graph-Informed Action Generation with Large Language Model , From Associations to Activations: Comparing Behavioral and Hidden-State Semantic Geometry in LLMs , Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs , KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models

Protocol Takeaways

Most common quality-control signal is rater calibration (4% of papers).

Evidence: Building Safe and Deployable Clinical Natural Language Processing under Temporal Leakage Constraints , Large Language Models as Automatic Annotators and Annotation Adjudicators for Fine-Grained Opinion Analysis , From Associations to Activations: Comparing Behavioral and Hidden-State Semantic Geometry in LLMs , Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.

Evidence: KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models , Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization , APEX-Agents , From Associations to Activations: Comparing Behavioral and Hidden-State Semantic Geometry in LLMs
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Evidence: Large Language Models as Automatic Annotators and Annotation Adjudicators for Fine-Grained Opinion Analysis , RebuttalAgent: Strategic Persuasion in Academic Rebuttal via Theory of Mind , From Associations to Activations: Comparing Behavioral and Hidden-State Semantic Geometry in LLMs , Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs

Benchmark Interpretation

Retrieval appears in 16% of hub papers (8/50); use this cohort for benchmark-matched comparisons.
ALFWorld appears in 2% of hub papers (1/50); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 8% of hub papers (4/50); compare with a secondary metric before ranking methods.
cost is reported in 6% of hub papers (3/50); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (22% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (8% vs 30% target).
Maintain strength on Papers naming benchmarks/datasets. Coverage is strong (36% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (42% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (16% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (14% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (22% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (8% vs 30% target).

Papers naming benchmarks/datasets

Coverage is strong (36% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (42% vs 35% target).

Papers with known rater population

Coverage is a replication risk (16% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (14% vs 35% target).

Known Limitations

Only 8% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (16% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Judge vs Human Agreement - Compares papers that evaluate with both human raters and LLM judges.
Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
IAA-Reported Evaluations - Highlights evaluations that explicitly report inter-annotator agreement.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs llm_as_judge

both=1, left_only=2, right_only=0

1 papers use both Human Eval and Llm As Judge.

human_eval vs automatic_metrics

both=0, left_only=3, right_only=42

0 papers use both Human Eval and Automatic Metrics.

llm_as_judge vs automatic_metrics

both=0, left_only=1, right_only=42

0 papers use both Llm As Judge and Automatic Metrics.

Benchmark Brief

Retrieval

Coverage: 8 papers (16%)

8 papers (16%) mention Retrieval.

Examples: Embodied Task Planning via Graph-Informed Action Generation with Large Language Model , AWED-FiNER: Agents, Web applications, and Expert Detectors for Fine-grained Named Entity Recognition across 36 Languages for 6.6 Billion Speakers , Context Volume Drives Performance: Tackling Domain Shift in Extremely Low-Resource Translation via RAG

Benchmark Brief

ALFWorld

Coverage: 1 papers (2%)

1 papers (2%) mention ALFWorld.

Examples: Embodied Task Planning via Graph-Informed Action Generation with Large Language Model

Benchmark Brief

DocVQA

Coverage: 1 papers (2%)

1 papers (2%) mention DocVQA.

Examples: Chain-of-Thought Compression Should Not Be Blind: V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring

Metric Brief

accuracy

Coverage: 4 papers (8%)

4 papers (8%) mention accuracy.

Examples: Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs , KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models , FROST: Filtering Reasoning Outliers with Attention for Efficient Reasoning

Metric Brief

cost

Coverage: 3 papers (6%)

3 papers (6%) mention cost.

Examples: Embodied Task Planning via Graph-Informed Action Generation with Large Language Model , Large Language Models as Automatic Annotators and Annotation Adjudicators for Fine-Grained Opinion Analysis , Fast-weight Product Key Memory

Metric Brief

Coverage: 3 papers (6%)

3 papers (6%) mention f1.

Examples: Human Values in a Single Sentence: Moral Presence, Hierarchies, and Transformer Ensembles on the Schwartz Continuum , Event Detection with a Context-Aware Encoder and LoRA for Improved Performance on Long-Tailed Classes , CAST: Character-and-Scene Episodic Memory for Agents

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: From Associations to Activations: Comparing Behavioral and Hidden-State Semantic Geometry in LLMs , Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs , KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Papers: Cross-Cultural Expert-Level Art Critique Evaluation with Vision-Language Models

Papers Published On This Date

From Associations to Activations: Comparing Behavioral and Hidden-State Semantic Geometry in LLMs
Louis Schiekiera, Max Zimmer, Christophe Roux, Sebastian Pokutta, Fritz Günther · Jan 31, 2026

Using representational similarity analysis, we compare behavioral geometries to layerwise hidden-state similarity and benchmark against FastText, BERT, and cross-model consensus.
Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs
Xiang Zheng, Weiqi Zhai, Wei Wang, Boyu Yang, Wenbo Li · Jan 31, 2026

Multi Agent

Recent large language models (LLMs) achieve near-saturation accuracy on many established mathematical reasoning benchmarks, raising concerns about their ability to diagnose genuine reasoning competence.
KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models
Zukang Xu, Zhixiong Zhao, Xing Hu, Zhixuan Chen, Dawei Yang · Jan 30, 2026

Mixture of Experts (MoE) models have achieved great success by significantly improving performance while maintaining computational efficiency through sparse expert activation.
Embodied Task Planning via Graph-Informed Action Generation with Large Language Model
Xiang Li, Ning Yan, Masood Mortazavi · Jan 29, 2026

Long Horizon

While Large Language Models (LLMs) have demonstrated strong zero-shot reasoning capabilities, their deployment as embodied agents still faces fundamental challenges in long-horizon planning.
Indic-TunedLens: Interpreting Multilingual Models in Indian Languages
Mihir Panchal, Deeksha Varshney, Mamta, Asif Ekbal · Jan 29, 2026

We evaluate our framework on 10 Indian languages using the MMLU benchmark and find that it significantly improves over SOTA interpretability methods, especially for morphologically rich, low resource languages.
INSURE-Dial: A Phase-Aware Conversational Dataset & Benchmark for Compliance Verification and Phase Detection
Shubham Kulkarni, Alexander Lyzhov, Preetam Joshi, Shiva Chaitanya · Jan 28, 2026

Web Browsing

We introduce INSURE-Dial, to our knowledge the first public benchmark for developing and assessing compliance-aware voice agents for phase-aware call auditing with span-based compliance verification.
Understanding LLM Failures: A Multi-Tape Turing Machine Analysis of Systematic Errors in Language Model Reasoning
Magnus Boman · Jan 27, 2026

Large language models (LLMs) exhibit failure modes on seemingly trivial tasks.
One Token Is Enough: Improving Diffusion Language Models with a Sink Token
Zihou Zhang, Zheyong Xie, Li Zhong, Haifeng Liu, Yao Hu · Jan 27, 2026

Diffusion Language Models (DLMs) have emerged as a compelling alternative to autoregressive approaches, enabling parallel text generation with competitive performance.
FROST: Filtering Reasoning Outliers with Attention for Efficient Reasoning
Haozheng Luo, Zhuolin Jiang, Md Zahid Hasan, Yan Chen, Soumalya Sarkar · Jan 26, 2026

Empirically, we validate FROST on four benchmarks using two strong reasoning models (Phi-4-Reasoning and GPT-OSS-20B), outperforming state-of-the-art methods such as TALE and ThinkLess.
Flatter Tokens are More Valuable for Speculative Draft Model Training
Jiaming Fan, Daming Cao, Xiangzhong Luo, Jiale Fu, Chonghan Liu · Jan 26, 2026

Speculative Decoding (SD) is a key technique for accelerating Large Language Model (LLM) inference, but it typically requires training a draft model on a large dataset.
Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization
Jingyi Xu, Xingyu Ren, Zhoupeng Shou, Yumeng Zhang, Zhiqiang You · Jan 24, 2026

Pairwise Preference Long Horizon

Large language models show potential in task-oriented dialogue systems, yet existing training methods often rely on token-level likelihood or preference optimization, which poorly align with long-horizon task success.
Building Safe and Deployable Clinical Natural Language Processing under Temporal Leakage Constraints
Ha Na Cho, Sairam Sutari, Alexander Lopez, Hansen Bow, Kai Zheng · Jan 24, 2026

Such behavior poses substantial risks for real-world deployment, where overconfident or temporally invalid predictions can disrupt clinical workflows and compromise patient safety.
Large Language Models as Automatic Annotators and Annotation Adjudicators for Fine-Grained Opinion Analysis
Gaurav Negi, MA Waskow, John McCrae, Paul Buitelaar · Jan 23, 2026

Although this level of detail is sound, it requires considerable human effort and substantial cost to annotate opinions in datasets for training models, especially across diverse domains and real-world applications.
PhysE-Inv: A Physics-Encoded Inverse Modeling approach for Arctic Snow Depth Prediction
Akila Sampath, Vandana Janeja, Jianwu Wang · Jan 23, 2026

The accurate estimation of Arctic snow depth remains a critical time-varying inverse problem due to the scarcity in associated sea ice parameters.
Between Search and Platform: ChatGPT Under the DSA
Toni Lorente, Kathrin Gardhouse · Jan 22, 2026

Web Browsing

This article examines the applicability of the Digital Services Act (DSA) to ChatGPT, arguing that it should be classified as a hybrid of the two types of hosting services: online search engines and platforms.
ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models
Shir Ashury-Tahan, Yifan Mai, Elron Bandel, Michal Shmueli-Scheuer, Leshem Choshen · Jan 22, 2026

Large Language Models (LLM) benchmarks tell us when models fail, but not why they fail.
RebuttalAgent: Strategic Persuasion in Academic Rebuttal via Theory of Mind
Zhitao He, Zongwei Lyu, Yi R Fung · Jan 22, 2026

Pairwise PreferenceCritique Edit

In this paper, we introduce RebuttalAgent, the first framework to ground academic rebuttal in Theory of Mind (ToM), operationalized through a ToM-Strategy-Response (TSR) framework that models reviewer mental state, formulates persuasion str
APEX-Agents
Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman · Jan 20, 2026

Rubric RatingExpert Verification Long Horizon

We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate law
Human Values in a Single Sentence: Moral Presence, Hierarchies, and Transformer Ensembles on the Schwartz Continuum
Víctor Yeste, Paolo Rosso · Jan 20, 2026

We study sentence-level detection of the 19 human values in the refined Schwartz continuum in about 74k English sentences from news and political manifestos (ValueEval'24 corpus).
Chain-of-Thought Compression Should Not Be Blind: V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring
Dongxu Zhang, Yiding Sun, Cheng Tan, Wenbiao Yan, Ning Yang · Jan 20, 2026

While Chain-of-Thought (CoT) reasoning significantly enhances the performance of Multimodal Large Language Models (MLLMs), its autoregressive nature incurs prohibitive latency constraints.
When LLMs Imagine People: A Human-Centered Persona Brainstorm Audit for Bias and Fairness in Creative Applications
Hongliu Cao, Eoin Thomas, Rodrigo Acuna Agost · Jan 19, 2026

Existing methods rely on constrained tasks and fixed benchmarks, leaving open-ended creative outputs unexamined.
Multimodal Multi-Agent Empowered Legal Judgment Prediction
Zhaolu Kang, Junhao Gong, Qingxi Chen, Hao Zhang, Jiaxin Liu · Jan 19, 2026

Multi Agent

Furthermore, we build JurisMM, a large dataset with over 100,000 recent Chinese judicial records, including both text and multimodal video-text data, enabling comprehensive evaluation.
Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space
Wang Zixian · Jan 18, 2026

Long Horizon

Experiments on MATH benchmarks show that the Hilbert projection formulation prevents gradient saturation typical of KL-constrained methods.
Event Detection with a Context-Aware Encoder and LoRA for Improved Performance on Long-Tailed Classes
Abdullah Al Monsur, Nitesh Vamshi Bommisetty, Gene Louis Kim · Jan 17, 2026

The current state of event detection research has two notable re-occurring limitations that we investigate in this study.
Generating metamers of human scene understanding
Ritik Raina, Abe Leite, Alexandros Graikos, Seoyoung Ahn, Dimitris Samaras · Jan 16, 2026

Human vision combines low-resolution "gist" information from the visual periphery with sparse but high-resolution information from fixated locations to construct a coherent understanding of a visual scene.
Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering
Xinyu Zhu, Yuzhu Cai, Zexi Liu, Bingyang Zheng, Cheng Wang · Jan 15, 2026

Long Horizon

The advancement of artificial intelligence toward agentic science is currently bottlenecked by the challenge of ultra-long-horizon autonomy, the ability to sustain strategic coherence and iterative correction over experimental cycles spanni
AWED-FiNER: Agents, Web applications, and Expert Detectors for Fine-grained Named Entity Recognition across 36 Languages for 6.6 Billion Speakers
Prachuryya Kaushik, Ashish Anand · Jan 15, 2026

We introduce \textbf{AWED-FiNER}, an open-source collection of agentic tool, web application, and 53 state-of-the-art expert models that provide Fine-grained Named Entity Recognition (FgNER) solutions across 36 languages spoken by more than
Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment
Cameron Tice, Puria Radmard, Samuel Ratnam, Andy Kim, David Africa · Jan 15, 2026

We share our models, data, and evaluations at AlignmentPretraining.ai.
Context Volume Drives Performance: Tackling Domain Shift in Extremely Low-Resource Translation via RAG
David Samuel Setiawan, Raphaël Merx, Jey Han Lau · Jan 15, 2026

Qualitative analysis confirms the LLM acts as a robust "safety net," repairing severe failures in zero-shot domains.
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz · Jan 14, 2026

Pairwise Preference Long Horizon

Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodie
CLiMB: A Domain-Informed Novelty Detection Clustering Framework for Galactic Archaeology and Scientific Discovery
Lorenzo Monti, Tatiana Muraveva, Brian Sheridan, Davide Massari, Alessia Garofalo · Jan 14, 2026

In data-driven scientific discovery, a challenge lies in classifying well-characterized phenomena while identifying novel anomalies.
CAST: Character-and-Scene Episodic Memory for Agents
Kexin Ma, Bojun Li, Yuhua Tang, Liting Sun, Ruochun Jin · Jan 14, 2026

Episodic memory is a central component of human memory, which refers to the ability to recall coherent events grounded in who, when, and where.
A Parallel Cross-Lingual Benchmark for Multimodal Idiomaticity Understanding
Dilara Torunoğlu-Selamet, Dogukan Arslan, Rodrigo Wilkens, Wei He, Doruk Eryiğit · Jan 13, 2026

Pairwise Preference

The dataset, containing 34 languages and over ten thousand items, allows comparative analyses of idiomatic patterns among language-specific realisations and preferences in order to gather insights about shared cultural aspects.
FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures
Jifeng Song, Arun Das, Pan Wang, Hui Ji, Kun Zhao · Jan 12, 2026

To support high-quality supervision, we curate BioSci-Fig-Cap, a refined benchmark for panel-level grounding, alongside cross-disciplinary test suites in physics and chemistry.
VULCA-Bench: A Multicultural Vision-Language Benchmark for Evaluating Cultural Understanding
Haorui Yu, Diji Yang, Hang He, Fengrui Zhang, Qiufeng Yi · Jan 12, 2026

Critique Edit

We introduce VULCA-Bench, a multicultural art-critique benchmark for evaluating Vision-Language Models' (VLMs) cultural understanding beyond surface-level visual perception.
Cross-Cultural Expert-Level Art Critique Evaluation with Vision-Language Models
Haorui Yu, Xuehang Wen, Fengrui Zhang, Qiufeng Yi · Jan 12, 2026

Rubric RatingCritique Edit

Existing benchmarks assess perception without interpretation, and common evaluation proxies, such as automated metrics and LLM-judge averaging, are unreliable for culturally sensitive generative tasks.
Reward Modeling from Natural Language Human Feedback
Zongqi Wang, Rui Wang, Yuchuan Wu, Yiyao Yu, Pinyi Zhang · Jan 12, 2026

Pairwise PreferenceCritique Edit

Reinforcement Learning with Verifiable reward (RLVR) on preference data has become the mainstream approach for training Generative Reward Models (GRMs).
Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching
Stephen Gadd · Jan 11, 2026

Linking names across historical sources, languages, and writing systems remains a fundamental challenge in digital humanities and geographic information retrieval.
Mixture-of-Experts as Soft Clustering: A Dual Jacobian-PCA Spectral Geometry Perspective
Feilong Liu · Jan 9, 2026

Mixture-of-Experts (MoE) architectures are widely used for efficiency and conditional computation, but their effect on the geometry of learned functions and representations remains poorly understood.
HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue
Laya Iyer, Kriti Aggarwal, Sanmi Koyejo, Gail Heyman, Desmond C. Ong · Jan 9, 2026

Pairwise PreferenceRubric Rating

Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans.
Neurosymbolic Retrievers for Retrieval-augmented Generation
Yash Saxena, Manas Gaur · Jan 8, 2026

Retrieval Augmented Generation (RAG) has made significant strides in overcoming key limitations of large language models, such as hallucination, lack of contextual grounding, and issues with transparency.
What Matters For Safety Alignment?
Xing Li, Hui-Ling Zhen, Lihao Yin, Xianzhi Yu, Zhenhua Dong · Jan 7, 2026

Red Team Tool Use

This paper presents a comprehensive empirical study on the safety alignment capabilities.
Stratified Hazard Sampling: Minimal-Variance Event Scheduling for CTMC/DTMC Discrete Diffusion and Flow Models
Seunghwan Jang, SooJean Han · Jan 6, 2026

Uniform-noise discrete diffusion and flow models (e.g., D3PM, SEDD, UDLM, DFM) generate sequences non-autoregressively by iteratively refining randomly initialized vocabulary tokens through multiple context-dependent replacements.
SYNAPSE: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation
Hanqi Jiang, Junhao Chen, Yi Pan, Ling Chen, Weihang You · Jan 6, 2026

While Large Language Models (LLMs) excel at generalized reasoning, standard retrieval-augmented approaches fail to address the disconnected nature of long-term agentic memory.
Embedding Retrofitting: Data Engineering for better RAG
Anantha Sharma · Jan 6, 2026

Embedding retrofitting adjusts pre-trained word vectors using knowledge graph constraints to improve domain-specific retrieval.
The Invisible Hand of AI Libraries Shaping Open Source Projects and Communities
Matteo Esposito, Andrea Janes, Valentina Lenarduzzi, Davide Taibi · Jan 5, 2026

In the early 1980s, Open Source Software emerged as a revolutionary concept amidst the dominance of proprietary software.
CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving
Shuhang Chen, Yunqiu Xu, Junjie Xie, Aojun Lu, Tao Feng · Jan 5, 2026

Motivated by this, we present CogFlow, a novel cognitive-inspired three-stage framework that incorporates a knowledge internalization stage, explicitly simulating the hierarchical flow of human reasoning: perception$\Rightarrow$internalizat
ARGUS: Adaptive Rotation-Invariant Geometric Unsupervised System
Anantha Sharma · Jan 3, 2026

Pairwise Preference

Detecting distributional drift in high-dimensional data streams presents fundamental challenges: global comparison methods scale poorly, projection-based approaches lose geometric structure, and re-clustering methods suffer from identity in
Improving Variational Autoencoder using Random Fourier Transformation: An Aviation Safety Anomaly Detection Case-Study
Ata Akbari Asanjan, Milad Memarzadeh, Bryan Matthews, Nikunj Oza · Jan 3, 2026

We showcase our findings with two low-dimensional synthetic datasets for data representation, and an aviation safety dataset, called Dashlink, for high-dimensional reconstruction-based anomaly detection.
Fast-weight Product Key Memory
Tianyu Zhao, Llion Jones · Jan 2, 2026

Notably, in Needle-in-a-Haystack evaluations, FwPKM generalizes to 128K-token contexts despite being trained on only 4K-token sequences.

Recent Daily Archives

fortnight-2026-f04 (335) week-2026-w08 (288) quarter-2026-q1 (732) month-2026-02 (682) fortnight-2026-f05 (323) week-2026-w09 (323) 2026-02-23 (56) 2026-02-25 (89) 2026-02-24 (118) 2026-02-20 (37) week-2026-w07 (47) 2026-02-26 (60) quarter-2025-q4 (124) 2026-02-18 (56) 2026-02-21 (22) 2026-02-17 (54) 2026-02-19 (60) fortnight-2026-f02 (27) 2026-02-22 (23) 2026-02-16 (36) month-2025-10 (69) quarter-2025-q2 (78) fortnight-2026-f01 (16) fortnight-2026-f03 (34) quarter-2025-q3 (90) week-2026-w06 (23) month-2025-12 (30) fortnight-2025-f22 (32) quarter-2025-q1 (35) month-2025-06 (39) month-2025-09 (44) month-2025-11 (25) week-2026-w03 (15) week-2026-w04 (12) 2026-02-13 (8) fortnight-2025-f21 (32) 2026-02-15 (7) fortnight-2025-f20 (34) fortnight-2025-f12 (29) week-2025-w39 (21)

HFEPX Monthly Archive: 2026-01

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Papers Published On This Date

Recent Daily Archives