Daily Archive

HFEPX Fortnight Archive: 2026-F04

Updated from current HFEPX corpus (Feb 27, 2026). 335 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: Retrieval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 22, 2026.

Papers: 335 Last published: Feb 22, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 335 papers for HFEPX Fortnight Archive: 2026-F04. Dominant protocol signals include automatic metrics, simulation environments, human evaluation, with frequent benchmark focus on Retrieval, MATH and metric focus on accuracy, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

14.6% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering , Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition , PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification , Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations
automatic metrics appears in 88.1% of papers in this hub.

Evidence: Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition , PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification , Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations , Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering , Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition , PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification , Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

Protocol Takeaways

Most common quality-control signal is rater calibration (2.7% of papers).

Evidence: Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition , Astra: Activation-Space Tail-Eigenvector Low-Rank Adaptation of Large Language Models , PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification , Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.

Evidence: Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition , PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification , Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations , Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Evidence: Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition , PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification , Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations , Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering

Benchmark Interpretation

Retrieval appears in 10.4% of hub papers (35/335); use this cohort for benchmark-matched comparisons.
MATH appears in 2.7% of hub papers (9/335); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 22.4% of hub papers (75/335); compare with a secondary metric before ranking methods.
cost is reported in 7.5% of hub papers (25/335); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (14.6% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (4.5% vs 30% target).
Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (24.8% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (46% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (8.7% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (11.3% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (14.6% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (4.5% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (24.8% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (46% vs 35% target).

Papers with known rater population

Coverage is a replication risk (8.7% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (11.3% vs 35% target).

Known Limitations

Only 4.5% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (8.7% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Judge vs Human Agreement - Compares papers that evaluate with both human raters and LLM judges.
Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
IAA-Reported Evaluations - Highlights evaluations that explicitly report inter-annotator agreement.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs llm_as_judge

both=0, left_only=18, right_only=1

0 papers use both Human Eval and Llm As Judge.

human_eval vs automatic_metrics

both=2, left_only=16, right_only=293

2 papers use both Human Eval and Automatic Metrics.

llm_as_judge vs automatic_metrics

both=0, left_only=1, right_only=295

0 papers use both Llm As Judge and Automatic Metrics.

Benchmark Brief

Retrieval

Coverage: 35 papers (10.4%)

35 papers (10.4%) mention Retrieval.

Examples: Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering , Retrieval Augmented Enhanced Dual Co-Attention Framework for Target Aware Multimodal Bengali Hateful Meme Detection , Facet-Level Persona Control by Trait-Activated Routing with Contrastive SAE for Role-Playing LLMs

Benchmark Brief

MATH

Coverage: 9 papers (2.7%)

9 papers (2.7%) mention MATH.

Examples: Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards , RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models , Utility-Preserving De-Identification for Math Tutoring: Investigating Numeric Ambiguity in the MathEd-PII Benchmark Dataset

Benchmark Brief

GSM8K

Coverage: 6 papers (1.8%)

6 papers (1.8%) mention GSM8K.

Examples: Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models , SPQ: An Ensemble Technique for Large Language Model Compression , TFL: Targeted Bit-Flip Attack on Large Language Model

Metric Brief

accuracy

Coverage: 75 papers (22.4%)

75 papers (22.4%) mention accuracy.

Examples: Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition , Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations , VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval

Metric Brief

cost

Coverage: 25 papers (7.5%)

25 papers (7.5%) mention cost.

Examples: Next Reply Prediction X Dataset: Linguistic Discrepancies in Naively Generated Content , Do LLMs and VLMs Share Neurons for Inference? Evidence and Mechanisms of Cross-Modal Transfer , Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning

Metric Brief

precision

Coverage: 15 papers (4.5%)

15 papers (4.5%) mention precision.

Examples: PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification , TriTopic: Tri-Modal Graph-Based Topic Modeling with Iterative Refinement and Archetypes , MoBiQuant: Mixture-of-Bits Quantization for Token-Adaptive Elastic LLMs

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition , PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification , Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Papers Published On This Date

Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition
Minxue Tang, Yangyang Yu, Aolin Ding, Maziyar Baran Pouyan, Taha Belkhouja Yujia Bao · Feb 22, 2026

Recognizing implicit visual and textual patterns is essential in many real-world applications of modern AI.
PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification
Isun Chehreh, Ebrahim Ansari · Feb 22, 2026

Data collection involved 60,000 raw posts from various Persian social media platforms, followed by rigorous preprocessing and hybrid annotation combining ChatGPT-based few-shot prompting with human verification.
Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations
Dongming Jiang, Yi Li, Songtao Wei, Jinxin Yang, Ayushi Kishore · Feb 22, 2026

Long Horizon

Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows.
Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering
Maryam Amirizaniani, Alireza Salemi, Hamed Zamani · Feb 22, 2026

Pairwise Preference Long Horizon

Personalization in Question Answering (QA) requires answers that are both accurate and aligned with users' background, preferences, and historical context.
Retrieval Augmented Enhanced Dual Co-Attention Framework for Target Aware Multimodal Bengali Hateful Meme Detection
Raihan Tanvir, Md. Golam Rabiul Alam · Feb 22, 2026

Hateful content on social media increasingly appears as multimodal memes that combine images and text to convey harmful narratives.
Next Reply Prediction X Dataset: Linguistic Discrepancies in Naively Generated Content
Simon Münker, Nils Schwager, Kai Kugler, Michael Heseltine, Achim Rettinger · Feb 22, 2026

The increasing use of Large Language Models (LLMs) as proxies for human participants in social science research presents a promising, yet methodologically risky, paradigm shift.
TurkicNLP: An NLP Toolkit for Turkic Languages
Sherzod Hakimov · Feb 22, 2026

Natural language processing for the Turkic language family, spoken by over 200 million people across Eurasia, remains fragmented, with most languages lacking unified tooling and resources.
Reasoning Capabilities of Large Language Models. Lessons Learned from General Game Playing
Maciej Świechowski, Adam Żychowski, Jacek Mańdziuk · Feb 22, 2026

The main results indicate that three of the evaluated models generally perform well across most experimental settings, with performance degradation observed as the evaluation horizon increases (i.e., with a higher number of game steps).
Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLM
Francesca Bianco, Derek Shiller · Feb 22, 2026

This work supports a more evidence-driven (a) debate on AI sentience and welfare, and (b) governance when setting policy, auditing standards, and safety safeguards.
Facet-Level Persona Control by Trait-Activated Routing with Contrastive SAE for Role-Playing LLMs
Wenqiu Tang, Zhen Wan, Takahiro Komamizu, Ichiro Ide · Feb 22, 2026

Personality control in Role-Playing Agents (RPAs) is commonly achieved via training-free methods that inject persona descriptions and memory through prompts or retrieval-augmented generation, or via supervised fine-tuning (SFT) on persona-s
VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval
Diogo Glória-Silva, David Semedo, João Maglhães · Feb 22, 2026

Long Horizon

Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90\% accuracy on plan-aware VQA.
A Dataset for Named Entity Recognition and Relation Extraction from Art-historical Image Descriptions
Stefanie Schneider, Miriam Göldl, Julian Stalter, Ricarda Vollmer · Feb 22, 2026

The dataset is released as UIMA XMI Common Analysis Structure (CAS) files with accompanying images and bibliographic metadata, and can be used to benchmark and fine-tune NER and RE systems, including zero- and few-shot setups with Large Lan
AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG
Qijie You, Wenkai Yu, Wentao Zhang · Feb 22, 2026

Long Horizon

With the rapid advancement of agent-based methods in recent years, Agentic RAG has undoubtedly become an important research direction.
How Do LLMs Encode Scientific Quality? An Empirical Study Using Monosemantic Features from Sparse Autoencoders
Michael McCoubrey, Angelo Salatino, Francesco Osborne, Enrico Motta · Feb 22, 2026

In recent years, there has been a growing use of generative AI, and large language models (LLMs) in particular, to support both the assessment and generation of scientific work.
Astra: Activation-Space Tail-Eigenvector Low-Rank Adaptation of Large Language Models
Kainan Liu, Yong Zhang, Ning Cheng, Yun Zhu, Yanmeng Wang · Feb 22, 2026

Extensive experiments across natural language understanding (NLU) and natural language generation (NLG) tasks demonstrate that Astra consistently outperforms existing PEFT baselines across 16 benchmarks and even surpasses full fine-tuning (
Value Entanglement: Conflation Between Different Kinds of Good In (Some) Large Language Models
Seong Hah Cho, Junyi Li, Anna Leshinskaya · Feb 22, 2026

Among the characteristics of value representation in humans is that they distinguish among value of different kinds.
TriTopic: Tri-Modal Graph-Based Topic Modeling with Iterative Refinement and Archetypes
Roman Egger · Feb 22, 2026

In benchmarks across 20 Newsgroups, BBC News, AG News, and Arxiv, TriTopic achieves the highest NMI on every dataset (mean NMI 0.575 vs.
Do LLMs and VLMs Share Neurons for Inference? Evidence and Mechanisms of Cross-Modal Transfer
Chenhang Cui, An Zhang, Yuxin Chen, Gelei Deng, Jingnan Zheng · Feb 22, 2026

Long Horizon

Across diverse mathematics and perception benchmarks, SNRF consistently enhances LVLM inference performance while preserving perceptual capabilities.
IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning
Yinhan He, Yaochen Zhu, Mingjia Shi, Wendy Zheng, Lin Su · Feb 22, 2026

Extensive empirical evaluations demonstrate that information-aware advantage shaping is a powerful and general direction for token-efficient post-training.
Uncovering Context Reliance in Unstructured Knowledge Editing
Zisheng Zhou, Mengqi Zhang, Shiguang Wu, Xiaotian Ye, Chi Zhang · Feb 22, 2026

Evaluations show that COIN reduces Context Reliance by 45.2% and outperforms strong baselines by 23.6% in editing success rate, highlighting the vital role of mitigating Context Reliance for robust editing.
Learning to Detect Language Model Training Data via Active Reconstruction
Junjie Oscar Yin, John X. Morris, Vitaly Shmatikov, Sewon Min, Hannaneh Hajishirzi · Feb 22, 2026

Detecting LLM training data is generally framed as a membership inference attack (MIA) problem.
Capable but Unreliable: Canonical Path Deviation as a Causal Mechanism of Agent Failure in Long-Horizon Tasks
Wilson Y. Lee · Feb 22, 2026

Long Horizon

Why do language agents fail on tasks they are capable of solving?
Benchmark Test-Time Scaling of General LLM Agents
Xiaochuan Li, Ryan Ming, Pranav Setlur, Abhijay Paladugu, Andy Tang · Feb 22, 2026

LLM agents are increasingly expected to function as general-purpose systems capable of resolving open-ended user requests.
Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation
Yonathan Ron, Shiri Gilboa, Tammuz Dubnov · Feb 21, 2026

Multi Agent

We introduce Whisper: Courtside Edition, a novel multi-agent large language model (LLM) pipeline that enhances Whisper transcriptions without retraining.
Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language
Toheeb Aduramomi Jimoh, Tabea De Wille, Nikola S. Nikolov · Feb 21, 2026

Pairwise Preference

One annotator pair achieved almost perfect agreement ($κ= 0.8743$; $93.8\%$ raw agreement), exceeding a number of reported benchmarks for English sarcasm research works.
MoBiQuant: Mixture-of-Bits Quantization for Token-Adaptive Elastic LLMs
Dongwei Wang, Jinhee Kim, Seokho Han, Denis Gudovskiy, Yohei Nakata · Feb 21, 2026

Changing runtime complexity on cloud and edge devices necessitates elastic large language model (LLM) deployment, where an LLM can be inferred with various quantization precisions based on available computational resources.
Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning
Abhinaba Basu · Feb 21, 2026

Personal AI agents incur substantial cost via repeated LLM calls.
DeepInnovator: Triggering the Innovative Capabilities of LLMs
Tianyu Fan, Fengji Zhang, Yuxiang Zheng, Bei Chen, Xinyao Niu · Feb 21, 2026

The application of Large Language Models (LLMs) in accelerating scientific discovery has garnered increasing attention, with a key focus on constructing research agents endowed with innovative capability, i.e., the ability to autonomously g
AAVGen: Precision Engineering of Adeno-associated Viral Capsids for Renal Selective Targeting
Mohammadreza Ghaffarzadeh-Esfahani, Yousof Gheisari · Feb 21, 2026

Adeno-associated viruses (AAVs) are promising vectors for gene therapy, but their native serotypes face limitations in tissue tropism, immune evasion, and production efficiency.
TRUE: A Trustworthy Unified Explanation Framework for Large Language Model Reasoning
Yujiao Yang · Feb 21, 2026

Extensive experiments across multiple reasoning benchmarks demonstrate that the proposed framework provides multi-level, verifiable explanations, including executable reasoning structures for individual instances, feasible-region representa
[b]=[d]-[t]+[p]: Self-supervised Speech Models Discover Phonological Vector Arithmetic
Kwanghee Choi, Eunjung Yeo, Cheol Jun Cho, David Harwath, David R. Mortensen · Feb 21, 2026

Self-supervised speech models (S3Ms) are known to encode rich phonetic information, yet how this information is structured remains underexplored.
Hyperbolic Busemann Neural Networks
Ziheng Chen, Bernhard Schölkopf, Nicu Sebe · Feb 21, 2026

Hyperbolic spaces provide a natural geometry for representing hierarchical and tree-structured data due to their exponential volume growth.
EvalSense: A Framework for Domain-Specific LLM (Meta-)Evaluation
Adam Dejl, Jonathan Pearson · Feb 21, 2026

Robust and comprehensive evaluation of large language models (LLMs) is essential for identifying effective LLM system configurations and mitigating risks associated with deploying LLMs in sensitive domains.
Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models
Abraham Paul Elenjical, Vivek Hruday Kavuri, Vasudeva Varma · Feb 21, 2026

Pairwise Preference

We introduce a psychologically grounded metacognitive framework that operationalizes Ann Brown's regulatory cycle (Planning, Monitoring, and Evaluation) as a structured prompting architecture, and study its integration within a lightweight
BURMESE-SAN: Burmese NLP Benchmark for Evaluating Large Language Models
Thura Aung, Jann Railey Montalan, Jian Gang Ngui, Peerat Limkonchotiwat · Feb 21, 2026

We introduce BURMESE-SAN, the first holistic benchmark that systematically evaluates large language models (LLMs) for Burmese across three core NLP competencies: understanding (NLU), reasoning (NLR), and generation (NLG).
MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs
Chun Yan Ryan Kan, Tommy Tran, Vedant Yadav, Ava Cai, Kevin Zhu · Feb 21, 2026

Red Team

Defending LLMs against adversarial jailbreak attacks remains an open challenge.
ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models
Anas Alhumud, Abdulaziz Alhammadi, Muhammad Badruddin Khan · Feb 21, 2026

We present ArabicNumBench, a comprehensive benchmark for evaluating large language models on Arabic number reading tasks across Eastern Arabic-Indic numerals (0-9 in Arabic script) and Western Arabic numerals (0-9).
The Convergence of Schema-Guided Dialogue Systems and the Model Context Protocol
Andreas Schlapbach · Feb 21, 2026

This paper establishes a fundamental convergence: Schema-Guided Dialogue (SGD) and the Model Context Protocol (MCP) represent two manifestations of a unified paradigm for deterministic, auditable LLM-agent interaction.
Rethinking Retrieval-Augmented Generation as a Cooperative Decision-Making Problem
Lichang Song, Ting Long, Yi Chang · Feb 21, 2026

Multi Agent

To overcome this limitation, we reformulate RAG as a cooperative multi-agent decision-making problem and propose Cooperative Retrieval-Augmented Generation (CoRAG), a framework in which the reranker and the generator act as peer decision-ma
ReHear: Iterative Pseudo-Label Refinement for Semi-Supervised Speech Recognition via Audio Large Language Models
Zefang Liu, Chenyang Zhu, Sangwoo Cho, Shi-Xiong Zhang · Feb 21, 2026

Experimental results across diverse benchmarks demonstrate that ReHear effectively mitigates error propagation, consistently outperforming both supervised and pseudo-labeling baselines.
Watermarking LLM Agent Trajectories
Wenlong Meng, Chen Gong, Terry Yue Zhuo, Fan Zhang, Kecen Li · Feb 21, 2026

Long Horizon

LLM agents rely heavily on high-quality trajectory data to guide their problem-solving behaviors, yet producing such data requires substantial task design, high-capacity model generation, and manual filtering.
Semantic Substrate Theory: An Operator-Theoretic Framework for Geometric Semantic Drift
Stephen Russell · Feb 21, 2026

Long Horizon

Most semantic drift studies report multiple signals e.g., embedding displacement, neighbor changes, distributional divergence, and recursive trajectory instability, without a shared explanatory theory that relates them.
Contradiction to Consensus: Dual Perspective, Multi Source Retrieval Based Claim Verification with Source Level Disagreement using LLM
Md Badsha Biswas, Ozlem Uzuner · Feb 21, 2026

Through extensive evaluation on four benchmark datasets with five LLMs, we show that knowledge aggregation not only improves claim verification but also reveals differences in source-specific reasoning.
From Trial by Fire To Sleep Like a Baby: A Lexicon of Anxiety Associations for 20k English Multiword Expressions
Saif M. Mohammad · Feb 21, 2026

Anxiety is the unease about a possible future negative outcome.
Spilled Energy in Large Language Models
Adrian Robert Minut, Hazem Dewidar, Iacopo Masi · Feb 21, 2026

Evaluated on nine benchmarks across state-of-the-art LLMs (including LLaMA, Mistral, and Gemma) and on synthetic algebraic operations (Qwen3), our approach demonstrates robust, competitive hallucination detection and cross-task generalizati
PolyFrame at MWE-2026 AdMIRe 2: When Words Are Not Enough: Multimodal Idiom Disambiguation
Nina Hosseini-Kivanani · Feb 20, 2026

Multimodal models struggle with idiomatic expressions due to their non-compositional meanings, a challenge amplified in multilingual settings.
DP-RFT: Learning to Generate Synthetic Text via Differentially Private Reinforcement Fine-Tuning
Fangyuan Xu, Sihao Chen, Zinan Lin, Taiwei Shi, Sydney Graham · Feb 20, 2026

Differentially private (DP) synthetic data generation plays a pivotal role in developing large language models (LLMs) on private data, where data owners cannot provide eyes-on access to individual examples.
Diagnosing LLM Reranker Behavior Under Fixed Evidence Pools
Baris Arat, Emre Sefer · Feb 20, 2026

Standard reranking evaluations study how a reranker orders candidates returned by an upstream retriever.
Luna-2: Scalable Single-Token Evaluation with Small Language Models
Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel · Feb 20, 2026

Real-time guardrails require evaluation that is accurate, cheap, and fast - yet today's default, LLM-as-a-judge (LLMAJ), is slow, expensive, and operationally non-deterministic due to multi-token generation.
Hierarchical Reward Design from Language: Enhancing Alignment of Agent Behavior with Human Specifications
Zhiqin Qian, Ryan Diaz, Sangwon Seo, Vaibhav Unhelkar · Feb 20, 2026

Pairwise Preference Long Horizon

When training artificial intelligence (AI) to perform tasks, humans often care not only about whether a task is completed but also how it is performed.
VIRAASAT: Traversing Novel Paths for Indian Cultural Reasoning
Harshul Raj Surana, Arijit Maji, Aryan Vats, Akash Ghosh, Sriparna Saha · Feb 20, 2026

Existing Cultural benchmarks are (i) Manually crafted, (ii) contain single-hop questions testing factual recall, and (iii) prohibitively costly to scale, leaving this deficiency largely unmeasured.
RVR: Retrieve-Verify-Retrieve for Comprehensive Question Answering
Deniz Qian, Hung-Ting Chen, Eunsol Choi · Feb 20, 2026

Our method outperforms baselines, including agentic search approaches, achieving at least 10% relative and 3% absolute gain in complete recall percentage on a multi-answer retrieval dataset (QAMPARI).
SPQ: An Ensemble Technique for Large Language Model Compression
Jiamin Yao, Eren Gultepe · Feb 20, 2026

Applied to LLaMA-2-7B, SPQ achieves up to 75% memory reduction while maintaining or improving perplexity (e.g., WikiText-2 5.47 to 4.91) and preserving accuracy on downstream benchmarks such as C4, TruthfulQA, and GSM8K.
Subgroups of $U(d)$ Induce Natural RNN and Transformer Architectures
Joshua Nunley · Feb 20, 2026

This paper presents a direct framework for sequence models with hidden states on closed subgroups of U(d).
Validating Political Position Predictions of Arguments
Jordan Robinson, Angus R. Williams, Katie Atkinson, Anthony G. Cohn · Feb 20, 2026

Pairwise Preference

Real-world knowledge representation often requires capturing subjective, continuous attributes -- such as political positions -- that conflict with pairwise validation, the widely accepted gold standard for human evaluation.
Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System
Pavithra PM Nair, Preethu Rose Anish · Feb 20, 2026

Vichara surpasses existing judgment prediction benchmarks on both datasets, with GPT-4o mini achieving the highest performance (F1: 81.5 on PredEx, 80.3 on ILDC_expert), followed by Llama-3.1-8B.
On the "Induction Bias" in Sequence Models
M. Reza Ebrahimi, Michaël Defferrard, Sunny Panchal, Roland Memisevic · Feb 20, 2026

Together, these results demonstrate that state tracking remains a fundamental challenge for transformers, even when training and evaluation distributions match.
Predicting Contextual Informativeness for Vocabulary Learning using Deep Learning
Tao Wu, Adam Kapelner · Feb 20, 2026

In summary, we demonstrate that a modern embedding model on neural network architecture, when guided by human supervision, results in a low-cost large supply of near-perfect contexts for teaching vocabulary for a variety of target words.
PsihoRo: Depression and Anxiety Romanian Text Corpus
Alexandra Ciobotaru, Ana-Maria Bucur, Liviu P. Dinu · Feb 20, 2026

Psychological corpora in NLP are collections of texts used to analyze human psychology, emotions, and mental health.
VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean
Yutong Xin, Qiaochu Chen, Greg Durrett, Işil Dillig · Feb 20, 2026

However, most benchmarks for LLM-based proof automation are drawn from mathematics in the Mathlib ecosystem, whereas proofs in software verification are developed inside definition-rich codebases with substantial project-specific libraries.

Recent Daily Archives

week-2026-w08 (288) quarter-2026-q1 (732) month-2026-02 (682) fortnight-2026-f05 (323) week-2026-w09 (323) 2026-02-23 (56) 2026-02-25 (89) 2026-02-24 (118) month-2026-01 (50) 2026-02-20 (37) week-2026-w07 (47) 2026-02-26 (60) quarter-2025-q4 (124) 2026-02-18 (56) 2026-02-21 (22) 2026-02-17 (54) 2026-02-19 (60) fortnight-2026-f02 (27) 2026-02-22 (23) 2026-02-16 (36) month-2025-10 (69) quarter-2025-q2 (78) fortnight-2026-f01 (16) fortnight-2026-f03 (34) quarter-2025-q3 (90) week-2026-w06 (23) month-2025-12 (30) fortnight-2025-f22 (32) quarter-2025-q1 (35) month-2025-06 (39) month-2025-09 (44) month-2025-11 (25) week-2026-w03 (15) week-2026-w04 (12) 2026-02-13 (8) fortnight-2025-f21 (32) 2026-02-15 (7) fortnight-2025-f20 (34) fortnight-2025-f12 (29) week-2025-w39 (21)

HFEPX Fortnight Archive: 2026-F04

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Papers Published On This Date

Recent Daily Archives