Daily Archive

HFEPX Fortnight Archive: 2025-F21

Updated from current HFEPX corpus (Feb 27, 2026). 32 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequent quality control: Calibration. Frequently cited benchmark: Retrieval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Oct 18, 2025.

Papers: 32 Last published: Oct 18, 2025 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 32 papers for HFEPX Fortnight Archive: 2025-F21. Dominant protocol signals include automatic metrics, simulation environments, human evaluation, with frequent benchmark focus on Retrieval, AlpacaEval and metric focus on accuracy, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

15.6% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing , FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution , MNO: Multiscale Neural Operator for 3D Computational Fluid Dynamics , Detecting Early and Implicit Suicidal Ideation via Longitudinal and Information Environment Signals on Social Media
automatic metrics appears in 93.8% of papers in this hub.

Evidence: FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution , MNO: Multiscale Neural Operator for 3D Computational Fluid Dynamics , LUMI: Unsupervised Intent Clustering with Multiple Pseudo-Labels , Understanding the Ability of LLMs to Handle Character-Level Perturbation
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: MemoTime: Memory-Augmented Temporal Knowledge Graph Enhanced Large Language Model Reasoning , Embedding-Based Context-Aware Reranker , PRoH: Dynamic Planning and Reasoning over Knowledge Hypergraphs for Retrieval-Augmented Generation , FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution

Protocol Takeaways

Most common quality-control signal is rater calibration (3.1% of papers).

Evidence: Chlorophyll-a Mapping and Prediction in the Mar Menor Lagoon Using C2RCC-Processed Sentinel 2 Imagery , FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution , MNO: Multiscale Neural Operator for 3D Computational Fluid Dynamics , Detecting Early and Implicit Suicidal Ideation via Longitudinal and Information Environment Signals on Social Media
Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.

Evidence: FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution , MNO: Multiscale Neural Operator for 3D Computational Fluid Dynamics , Detecting Early and Implicit Suicidal Ideation via Longitudinal and Information Environment Signals on Social Media , LUMI: Unsupervised Intent Clustering with Multiple Pseudo-Labels
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Evidence: FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution , MNO: Multiscale Neural Operator for 3D Computational Fluid Dynamics , Detecting Early and Implicit Suicidal Ideation via Longitudinal and Information Environment Signals on Social Media , LUMI: Unsupervised Intent Clustering with Multiple Pseudo-Labels

Benchmark Interpretation

Retrieval appears in 12.5% of hub papers (4/32); use this cohort for benchmark-matched comparisons.
AlpacaEval appears in 3.1% of hub papers (1/32); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 25% of hub papers (8/32); compare with a secondary metric before ranking methods.
cost is reported in 6.3% of hub papers (2/32); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (15.6% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (3.1% vs 30% target).
Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (28.1% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (53.1% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (6.3% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (15.6% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (15.6% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (3.1% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (28.1% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (53.1% vs 35% target).

Papers with known rater population

Coverage is a replication risk (6.3% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (15.6% vs 35% target).

Known Limitations

Only 3.1% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (6.3% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Human Eval Protocols - Surfaces human-rating workflows for rubric and annotator quality analysis.
Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs automatic_metrics

both=1, left_only=0, right_only=29

1 papers use both Human Eval and Automatic Metrics.

automatic_metrics vs simulation_env

both=1, left_only=29, right_only=2

1 papers use both Automatic Metrics and Simulation Env.

simulation_env vs human_eval

both=0, left_only=3, right_only=1

0 papers use both Simulation Env and Human Eval.

Benchmark Brief

Retrieval

Coverage: 4 papers (12.5%)

4 papers (12.5%) mention Retrieval.

Examples: MemoTime: Memory-Augmented Temporal Knowledge Graph Enhanced Large Language Model Reasoning , Embedding-Based Context-Aware Reranker , PRoH: Dynamic Planning and Reasoning over Knowledge Hypergraphs for Retrieval-Augmented Generation

Benchmark Brief

AlpacaEval

Coverage: 1 papers (3.1%)

1 papers (3.1%) mention AlpacaEval.

Examples: Revisiting Self-Play Preference Optimization: On the Role of Prompt Difficulty

Benchmark Brief

Arena-Hard

Coverage: 1 papers (3.1%)

1 papers (3.1%) mention Arena-Hard.

Examples: Revisiting Self-Play Preference Optimization: On the Role of Prompt Difficulty

Metric Brief

accuracy

Coverage: 8 papers (25%)

8 papers (25%) mention accuracy.

Examples: MNO: Multiscale Neural Operator for 3D Computational Fluid Dynamics , Embedding-Based Context-Aware Reranker , Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing

Metric Brief

cost

Coverage: 2 papers (6.3%)

2 papers (6.3%) mention cost.

Examples: FML-bench: Benchmarking Machine Learning Agents for Scientific Research , EconCausal: A Context-Aware Causal Reasoning Benchmark for Large Language Models in Social Science

Metric Brief

Coverage: 2 papers (6.3%)

2 papers (6.3%) mention f1.

Examples: PRoH: Dynamic Planning and Reasoning over Knowledge Hypergraphs for Retrieval-Augmented Generation , Peeking inside the Black-Box: Reinforcement Learning for Explainable and Accurate Relation Extraction

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution , MNO: Multiscale Neural Operator for 3D Computational Fluid Dynamics , Detecting Early and Implicit Suicidal Ideation via Longitudinal and Information Environment Signals on Social Media

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Papers Published On This Date

FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution
Syed Rifat Raiyan, Md Farhan Ishmam, Abdullah Al Imran, Mohammad Ali Moni · Oct 18, 2025

Human communication heavily relies on laconism and inferential pragmatics, allowing listeners to successfully reconstruct rich meaning from sparse, telegraphic speech.
MNO: Multiscale Neural Operator for 3D Computational Fluid Dynamics
Qinxuan Wang, Chuang Wang, Mingyu Zhang, Jingwei Sun, Peipei Yang · Oct 17, 2025

We evaluate MNO on diverse benchmarks, covering steady-state and unsteady flow scenarios with up to 300k points.
Detecting Early and Implicit Suicidal Ideation via Longitudinal and Information Environment Signals on Social Media
Soorya Ram Shimgekar, Ruining Zhao, Agam Goyal, Violeta J. Rodriguez, Paul A. Bloom · Oct 16, 2025

On social media, several individuals experiencing suicidal ideation (SI) do not disclose their distress explicitly.
LUMI: Unsupervised Intent Clustering with Multiple Pseudo-Labels
I-Fan Lin, Faegheh Hasibi, Suzan Verberne · Oct 16, 2025

Our evaluation on four benchmark sets shows that our approach achieves competitive results, better than recent state-of-the-art baselines, while avoiding the need to estimate the number of clusters during embedding refinement, as is require
Understanding the Ability of LLMs to Handle Character-Level Perturbation
Anyuan Zhuo, Xuefei Ning, Ningyuan Li, Jingyi Zhu, Yu Wang · Oct 16, 2025

Surprisingly, even under severe perturbation, such as shuffling nearly all words character-wise to produce text that is almost unreadable to humans, or inserting invisible characters which are several times more than the visible ones as noi
Assessing Web Search Credibility and Response Groundedness in Chat Assistants
Ivan Vykopal, Matúš Pikuliak, Simon Ostermann, Marián Šimko · Oct 15, 2025

Chat assistants increasingly integrate web search functionality, enabling them to retrieve and cite external sources.
Closing the Gap Between Text and Speech Understanding in LLMs
Santiago Cuervo, Skyler Seto, Maureen de Seyssel, Richard He Bai, Zijin Gu · Oct 15, 2025

Applied to 3B and 7B LLMs, SALAD achieves competitive performance with a strong open-weight model across broad-domain benchmarks in knowledge, language understanding, and reasoning, while training on over an order of magnitude less speech d
MemoTime: Memory-Augmented Temporal Knowledge Graph Enhanced Large Language Model Reasoning
Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu, Xin Yuan · Oct 15, 2025

Comprehensive experiments on multiple temporal QA benchmarks show that MemoTime achieves overall state-of-the-art results, outperforming the strong baseline by up to 24.0%.
Embedding-Based Context-Aware Reranker
Ye Yuan, Mohammad Amin Shabani, Siqi Liu · Oct 15, 2025

We evaluate EBCAR against SOTA rerankers on the ConTEB benchmark, demonstrating its effectiveness for information retrieval requiring cross-passage inference and its advantages in both accuracy and efficiency.
Toward LLM-Supported Automated Assessment of Critical Thinking Subskills
Marisa C. Peczuh, Nischal Ashok Kumar, Ryan Baker, Blair Lehman, Danielle Eisenberg · Oct 14, 2025

Rubric Rating

As the world becomes increasingly saturated with AI-generated content, disinformation, and algorithmic persuasion, critical thinking - the capacity to evaluate evidence, detect unreliable claims, and exercise independent judgment - is becom
PRoH: Dynamic Planning and Reasoning over Knowledge Hypergraphs for Retrieval-Augmented Generation
Xiangjun Zai, Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu · Oct 14, 2025

Long Horizon

Experiments across multiple domains demonstrate that PRoH achieves state-of-the-art performance, surpassing the prior SOTA model HyperGraphRAG by an average of 19.73% in F1 and 8.41% in Generation Evaluation (G-E) score, while maintaining s
Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing
Rongzhi Zhang, Liqin Ye, Yuzhao Heng, Xiang Chen, Tong Yu · Oct 14, 2025

Pairwise Preference

Finally, we demonstrate efficiency enhancements across three downstream tasks: preference data synthesis, Pareto frontier approximation and optimization, and distillation of aligned behaviors for intervention-free inference.
FML-bench: Benchmarking Machine Learning Agents for Scientific Research
Qiran Zou, Hou Hei Lam, Wenhao Zhao, Yiming Tang, Tingting Chen · Oct 12, 2025

Large language models (LLMs) have sparked growing interest in machine learning research agents that can autonomously propose ideas and conduct experiments.
Mapping Semantic & Syntactic Relationships with Geometric Rotation
Michael Freenor, Lauren Alvarez · Oct 10, 2025

Demonstrations

Understanding how language and embedding models encode semantic relationships is fundamental to model interpretability.
The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach
Nizar El Ghazal, Antoine Caubrière, Valentin Vielzeuf · Oct 10, 2025

This paper presents a comparative study of context management strategies for end-to-end Spoken Dialog State Tracking using Speech-LLMs.
Chlorophyll-a Mapping and Prediction in the Mar Menor Lagoon Using C2RCC-Processed Sentinel 2 Imagery
Antonio Martínez-Ibarra, Aurora González-Vidal, Adrián Cánovas-Rodríguez, Antonio F. Skarmeta · Oct 10, 2025

The Mar Menor, Europe's largest hypersaline coastal lagoon, located in southeastern Spain, has undergone severe eutrophication crises, with devastating impacts on biodiversity and water quality.
Verifying Chain-of-Thought Reasoning via Its Computational Graph
Zheng Zhao, Yeskendir Koishekenov, Xianjun Yang, Naila Murray, Nicola Cancedda · Oct 10, 2025

Current Chain-of-Thought (CoT) verification methods predict reasoning correctness based on outputs (black-box) or activations (gray-box), but offer limited insight into why a computation fails.
Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs
Yumin Choi, Dongki Kim, Jinheon Baek, Sung Ju Hwang · Oct 10, 2025

To tackle this problem, we then propose the Multimodal Prompt Optimizer (MPO), a unified framework that not only performs the joint optimization of multimodal prompts through alignment-preserving updates but also guides the selection proces
FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs
Yan Wang, Keyi Wang, Shanshan Yang, Jaisal Patel, Jeff Zhao · Oct 10, 2025

We introduce FinAuditing, a taxonomy-aligned, structure-aware benchmark built from real XBRL filings.
Lossless Vocabulary Reduction for Auto-Regressive Language Models
Daiki Chijiwa, Taku Hasegawa, Kyosuke Nishida, Shin'ya Yamaguchi, Tomoya Ohba · Oct 9, 2025

Tokenization -- the process of decomposing a given text into a sequence of subwords called tokens -- is one of the key components in the development of language models.
Everything is Plausible: Investigating the Impact of LLM Rationales on Human Notions of Plausibility
Shramay Palta, Peter Rankel, Sarah Wiegreffe, Rachel Rudinger · Oct 9, 2025

We investigate the degree to which human plausibility judgments of multiple-choice commonsense benchmark answers are subject to influence by (im)plausibility arguments for or against an answer, in particular, using rationales generated by L
PATCH: Mitigating PII Leakage in Language Models with Privacy-Aware Targeted Circuit PatcHing
Anthony Hughes, Vasisht Duddu, N. Asokan, Nikolaos Aletras, Ning Ma · Oct 8, 2025

Language models (LMs) may memorize personally identifiable information (PII) from training data, enabling adversaries to extract it during inference.
EconCausal: A Context-Aware Causal Reasoning Benchmark for Large Language Models in Social Science
Donggyu Lee, Hyeok Yun, Meeyoung Cha, Sungwon Park, Sangyoon Park · Oct 8, 2025

To address this, we introduce EconCausal, a large-scale benchmark comprising 10,490 context-annotated causal triplets extracted from 2,595 high-quality empirical studies published in top-tier economics and finance journals.
Multi-hop Deep Joint Source-Channel Coding with Deep Hash Distillation for Semantically Aligned Image Recovery
Didrik Bergström, Deniz Gündüz, Onur Günlü · Oct 8, 2025

We consider image transmission via deep joint source-channel coding (DeepJSCC) over multi-hop additive white Gaussian noise (AWGN) channels by training a DeepJSCC encoder-decoder pair with a pre-trained deep hash distillation (DHD) module t
Peeking inside the Black-Box: Reinforcement Learning for Explainable and Accurate Relation Extraction
Xinyu Guo, Zhengliang Shi, Minglai Yang, Mahdi Rahimi, Mihai Surdeanu · Oct 7, 2025

Finally, human evaluation shows that our best model generates relational keywords closely aligned with gold labels, increasing human explanation quality ratings by 54% (relative).
Early Multimodal Prediction of Cross-Lingual Meme Virality on Reddit: A Time-Window Analysis
Sedat Dogan, Nina Dethlefs, Debarati Chakraborty · Oct 7, 2025

We benchmark interpretable baselines (XGBoost, MLP) against end-to-end deep models (BERT, InceptionV3, CLIP) across early observation windows from 30 to 420 minutes.
Improving Discrete Diffusion Unmasking Policies Beyond Explicit Reference Policies
Chunsan Hong, Seonho An, Min-Soo Kim, Jong Chul Ye · Oct 7, 2025

Empirically, across four benchmarks, our learned policy consistently outperforms max-confidence: for example, on SUDOKU, where unmasking order is critical, it yields a 20.1% gain over random and a 11.2% gain over max-confidence.
AgentDR: Dynamic Recommendation with Implicit Item-Item Relations via LLM-based Agents
Mingdai Yang, Nurendra Choudhary, Jiangshu Du, Edward W. Huang, Philip S. Yu · Oct 7, 2025

Recent agent-based recommendation frameworks aim to simulate user behaviors by incorporating memory mechanisms and prompting strategies, but they struggle with hallucinating non-existent items and full-catalog ranking.
Revisiting Self-Play Preference Optimization: On the Role of Prompt Difficulty
Yao Xiao, Jung-jae Kim, Roy Ka-wei Lee, Lidong Bing · Oct 7, 2025

Pairwise Preference

Self-play preference optimization has emerged as a prominent paradigm for aligning large language models (LLMs).
Slm-mux: Orchestrating small language models for reasoning
Chenyu Wang, Zishen Wan, Hao Kang, Emma Chen, Zhiqiang Xie · Oct 6, 2025

Additional experiments show that the core principle of SLM-MUX extends to open-ended generation tasks (e.g., HumanEval) and benefits other model classes, including frontier LLMs and domain-specific fine-tuned SLMs.
SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests
Punya Syon Pandey, Hai Son Le, Devansh Bhardwaj, Rada Mihalcea, Zhijing Jin · Oct 6, 2025

Critique Edit

Yet, existing safety benchmarks rarely test vulnerabilities in domains such as political manipulation, propaganda and disinformation generation, or surveillance and information control.
Multilingual Routing in Mixture-of-Experts
Lucas Bandarkar, Chenyuan Yang, Mohsen Fayyaz, Junlin Hu, Nanyun Peng · Oct 6, 2025

These 1-2% gains are remarkably consistent across two evaluation tasks, three models, and 15+ languages, especially given that these simple interventions override routers of extensively trained, state-of-the-art LLMs.

Recent Daily Archives

fortnight-2026-f04 (335) week-2026-w08 (288) quarter-2026-q1 (732) month-2026-02 (682) fortnight-2026-f05 (323) week-2026-w09 (323) 2026-02-23 (56) 2026-02-25 (89) 2026-02-24 (118) month-2026-01 (50) 2026-02-20 (37) week-2026-w07 (47) 2026-02-26 (60) quarter-2025-q4 (124) 2026-02-18 (56) 2026-02-21 (22) 2026-02-17 (54) 2026-02-19 (60) fortnight-2026-f02 (27) 2026-02-22 (23) 2026-02-16 (36) month-2025-10 (69) quarter-2025-q2 (78) fortnight-2026-f01 (16) fortnight-2026-f03 (34) quarter-2025-q3 (90) week-2026-w06 (23) month-2025-12 (30) fortnight-2025-f22 (32) quarter-2025-q1 (35) month-2025-06 (39) month-2025-09 (44) month-2025-11 (25) week-2026-w03 (15) week-2026-w04 (12) 2026-02-13 (8) 2026-02-15 (7) fortnight-2025-f20 (34) fortnight-2025-f12 (29) week-2025-w39 (21)

HFEPX Fortnight Archive: 2025-F21

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Papers Published On This Date

Recent Daily Archives