Daily Archive

HFEPX Fortnight Archive: 2025-F20

Updated from current HFEPX corpus (Feb 27, 2026). 34 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequent quality control: Calibration. Frequently cited benchmark: Retrieval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Oct 5, 2025.

Papers: 34 Last published: Oct 5, 2025 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 34 papers for HFEPX Fortnight Archive: 2025-F20. Dominant protocol signals include automatic metrics, simulation environments, LLM-as-judge, with frequent benchmark focus on Retrieval, Featbench and metric focus on accuracy, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

14.7% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity , Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning , Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval , BioX-Bridge: Model Bridging for Unsupervised Cross-Modal Knowledge Transfer across Biosignals
automatic metrics appears in 88.2% of papers in this hub.

Evidence: PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity , Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval , BioX-Bridge: Model Bridging for Unsupervised Cross-Modal Knowledge Transfer across Biosignals , Can AI Truly Represent Your Voice in Deliberations? A Comprehensive Study of Large-Scale Opinion Aggregation with LLMs
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval , PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity , BioX-Bridge: Model Bridging for Unsupervised Cross-Modal Knowledge Transfer across Biosignals , Can AI Truly Represent Your Voice in Deliberations? A Comprehensive Study of Large-Scale Opinion Aggregation with LLMs

Protocol Takeaways

Most common quality-control signal is rater calibration (5.9% of papers).

Evidence: Incentive-Aligned Multi-Source LLM Summaries , PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity , Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval , BioX-Bridge: Model Bridging for Unsupervised Cross-Modal Knowledge Transfer across Biosignals
Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.

Evidence: LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts , PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity , Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval , BioX-Bridge: Model Bridging for Unsupervised Cross-Modal Knowledge Transfer across Biosignals
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Evidence: PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity , Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval , BioX-Bridge: Model Bridging for Unsupervised Cross-Modal Knowledge Transfer across Biosignals , Can AI Truly Represent Your Voice in Deliberations? A Comprehensive Study of Large-Scale Opinion Aggregation with LLMs

Benchmark Interpretation

Retrieval appears in 8.8% of hub papers (3/34); use this cohort for benchmark-matched comparisons.
Featbench appears in 2.9% of hub papers (1/34); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 20.6% of hub papers (7/34); compare with a secondary metric before ranking methods.
cost is reported in 14.7% of hub papers (5/34); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (14.7% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (8.8% vs 30% target).
Close gap on Papers naming benchmarks/datasets. Coverage is a replication risk (14.7% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (47.1% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (5.9% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (20.6% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (14.7% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (8.8% vs 30% target).

Papers naming benchmarks/datasets

Coverage is a replication risk (14.7% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (47.1% vs 35% target).

Papers with known rater population

Coverage is a replication risk (5.9% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (20.6% vs 35% target).

Known Limitations

Only 8.8% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (5.9% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

LLM-as-Judge Protocols - Finds judge-based evaluation setups to compare calibration and drift risks.
Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
IAA-Reported Evaluations - Highlights evaluations that explicitly report inter-annotator agreement.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

llm_as_judge vs automatic_metrics

both=0, left_only=1, right_only=30

0 papers use both Llm As Judge and Automatic Metrics.

automatic_metrics vs simulation_env

both=0, left_only=30, right_only=4

0 papers use both Automatic Metrics and Simulation Env.

simulation_env vs llm_as_judge

both=1, left_only=3, right_only=0

1 papers use both Simulation Env and Llm As Judge.

Benchmark Brief

Retrieval

Coverage: 3 papers (8.8%)

3 papers (8.8%) mention Retrieval.

Examples: Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval , Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents , ProPerSim: Developing Proactive and Personalized AI Assistants through User-Assistant Simulation

Benchmark Brief

Featbench

Coverage: 1 papers (2.9%)

1 papers (2.9%) mention Featbench.

Examples: FeatBench: Towards More Realistic Evaluation of Feature-level Code Generation

Benchmark Brief

GPQA

Coverage: 1 papers (2.9%)

1 papers (2.9%) mention GPQA.

Examples: HEART: Emotionally-Driven Test-Time Scaling of Language Models

Metric Brief

accuracy

Coverage: 7 papers (20.6%)

7 papers (20.6%) mention accuracy.

Examples: Incentive-Aligned Multi-Source LLM Summaries , TimeOmni-1: Incentivizing Complex Reasoning with Time Series in Large Language Models , Uncovering Grounding IDs: How External Cues Shape Multimodal Binding

Metric Brief

cost

Coverage: 5 papers (14.7%)

5 papers (14.7%) mention cost.

Examples: BioX-Bridge: Model Bridging for Unsupervised Cross-Modal Knowledge Transfer across Biosignals , mini-vec2vec: Scaling Universal Geometry Alignment with Linear Transformations , PonderLM-2: Pretraining LLM with Latent Thoughts in Continuous Space

Metric Brief

calibration

Coverage: 2 papers (5.9%)

2 papers (5.9%) mention calibration.

Examples: LogiPart: Local Large Language Models for Data Exploration at Scale with Logical Partitioning , CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity , Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval , BioX-Bridge: Model Bridging for Unsupervised Cross-Modal Knowledge Transfer across Biosignals

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Papers Published On This Date

PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity
Zixin Song, Bowen Zhang, Qian-Wen Zhang, Di Yin, Xing Sun · Oct 5, 2025

Pairwise Preference

On the official C-STS benchmark, PoLi-RL achieves a Spearman correlation coefficient of 48.18, establishing a new SOTA for the cross-encoder architecture.
Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval
Yohan Lee, Yongwoo Song, Sangyeop Kim · Oct 3, 2025

We present the Conversational Data Retrieval (CDR) benchmark, the first comprehensive test set for evaluating systems that retrieve conversation data for product insights.
BioX-Bridge: Model Bridging for Unsupervised Cross-Modal Knowledge Transfer across Biosignals
Chenqi Li, Yu Liu, Timothy Denison, Tingting Zhu · Oct 2, 2025

Biosignals offer valuable insights into the physiological states of the human body.
Can AI Truly Represent Your Voice in Deliberations? A Comprehensive Study of Large-Scale Opinion Aggregation with LLMs
Shenzhe Zhu, Shu Yang, Michiel A. Bakker, Alex Pentland, Jiaxin Pei · Oct 2, 2025

Studying and fixing these issues requires a comprehensive evaluation at a large scale, yet current practice often relies on LLMs as judges, which show weak alignment with human judgments.
Hearing the Order: Investigating Position Bias in Large Audio-Language Models
Yu-Xiang Lin, Chen-An Li, Sheng-Lun Wei, Po-Chun Chen, Hsin-Hsi Chen · Oct 1, 2025

We demonstrate that no model is immune to this bias through extensive experiments on six LALMs across three widely used benchmarks and their spoken counterparts.
Latent Thinking Optimization: Your Latent Reasoning Language Model Secretly Encodes Reward Signals in Its Latent Thoughts
Hanwen Du, Yuxin Dong, Xia Ning · Sep 30, 2025

Large Language Models (LLMs) excel at problem solving by generating chain of thoughts in natural language, but such verbal thinking is computationally costly and prone to overthinking.
LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts
Yuan Zhuang, Yi Shen, Yuexin Bian, Qing Su, Shihao Ji · Sep 30, 2025

Extensive experiments on the Qwen3-1.7B and Llama-3.2-3B models show that LD-MoLE achieves the highest average scores compared to state-of-the-art baselines, across a diverse set of benchmarks.
Polychromic Objectives for Reinforcement Learning
Jubayer Ibn Hamid, Ifdita Hasan Orney, Ellen Xu, Chelsea Finn, Dorsa Sadigh · Sep 29, 2025

Reinforcement learning fine-tuning (RLFT) is a dominant paradigm for improving pretrained policies for downstream tasks.
Predicting Training Re-evaluation Curves Enables Effective Data Curriculums for LLMs
Shane Bergsma, Nolan Dey, Joel Hestness · Sep 29, 2025

We introduce the *training re-evaluation curve (TREC)*, a diagnostic that retrospectively evaluates training batches *using the final model weights*.
Generative Value Conflicts Reveal LLM Priorities
Andy Liu, Kshitish Ghate, Mona Diab, Daniel Fried, Atoosa Kasirzadeh · Sep 29, 2025

Comparing results between multiple-choice and open-ended evaluations, we find that models shift away from supporting protective values, such as harmlessness, and toward supporting personal values, such as user autonomy, in more open-ended v
Incentive-Aligned Multi-Source LLM Summaries
Yanchen Jiang, Zhe Feng, Aranyak Mehta · Sep 29, 2025

Large language models (LLMs) are increasingly used in modern search and answer systems to synthesize multiple, sometimes conflicting, texts into a single response, yet current pipelines offer weak incentives for sources to be accurate and a
TimeOmni-1: Incentivizing Complex Reasoning with Time Series in Large Language Models
Tong Guan, Zijie Meng, Dianqi Li, Shiyu Wang, Chao-Han Huck Yang · Sep 29, 2025

TSR-Suite is the first comprehensive time series reasoning suite that supports not only thorough evaluation but also the data pipeline and training of TSRMs.
Inducing Dyslexia in Vision Language Models
Melika Honarmand, Ayati Sharma, Badr AlKhamissi, Johannes Mehrer, Martin Schrimpf · Sep 29, 2025

Using stimuli from cognitive neuroscience, we identify visual-word-form-selective units within VLMs and demonstrate that they predict human VWFA neural responses.
Uncovering Grounding IDs: How External Cues Shape Multimodal Binding
Hosein Hasani, Amirmohammad Izadi, Fatemeh Askari, Mobin Bagherian, Sadegh Mohammadian · Sep 28, 2025

Large vision-language models (LVLMs) show strong performance across multimodal benchmarks but remain limited in structured reasoning and precise grounding.
SPELL: Self-Play Reinforcement Learning for Evolving Long-Context Language Models
Ziyi Yang, Weizhou Shen, Chenliang Li, Ruijun Chen, Fanqi Wan · Sep 28, 2025

This gap arises not only from the intrinsic difficulty of processing long texts, but also from the scarcity of reliable human annotations and programmatically verifiable reward signals.
Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning
Yucheng Wang, Yifan Hou, Aydin Javadov, Mubashara Akhtar, Mrinmaya Sachan · Sep 28, 2025

Pairwise Preference

These inconsistencies stem from a lack of controlled evaluation frameworks and analysis of models' internals to isolate when and why modality interactions support or undermine reasoning.
Characteristic Root Analysis and Regularization for Linear Time Series Forecasting
Zheng Wang, Kaixuan Zhang, Wanfang Chen, Xiaonan Lu, Longyuan Li · Sep 28, 2025

Extensive experiments on standard benchmarks demonstrate the effectiveness of both approaches, validating our theoretical insights and achieving state-of-the-art results in several settings.
mini-vec2vec: Scaling Universal Geometry Alignment with Linear Transformations
Guy Dar · Sep 27, 2025

We build upon vec2vec, a procedure designed to align text embedding spaces without parallel data.
PonderLM-2: Pretraining LLM with Latent Thoughts in Continuous Space
Boyi Zeng, He Li, Shixiang Song, Yixuan Wang, Ziwei He · Sep 27, 2025

The remarkable success of Chain-of-Thought (CoT), which enhances performance by scaling generation steps at test-time, inspires us to ask: can we leverage a similar scaling of computational steps during pretraining to improve the generation
RHYTHM: Reasoning with Hierarchical Temporal Tokenization for Human Mobility
Haoyu He, Haozheng Luo, Yan Chen, Qi R. Wang · Sep 27, 2025

Long Horizon

Predicting human mobility is inherently challenging due to complex long-range dependencies and multi-scale periodic behaviors.
General Exploratory Bonus for Optimistic Exploration in RLHF
Wendi Li, Changdae Oh, Sharon Li · Sep 27, 2025

Optimistic exploration is central to improving sample efficiency in reinforcement learning with human feedback, yet existing exploratory bonus methods to incentivize exploration often fail to realize optimism.
Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents
Yaorui Shi, Yuxin Chen, Siyuan Wang, Sihang Li, Hengxing Cai · Sep 27, 2025

To tackle these challenges, we present ReMemR1, which integrates the mechanism of memory retrieval into the memory update process, enabling the agent to selectively callback historical memories for non-linear reasoning.
HEART: Emotionally-Driven Test-Time Scaling of Language Models
Gabriela Pinto, Palash Goyal, Mihir Parmar, Yiwen Song, Souradip Chakraborty · Sep 26, 2025

We introduce HEART, a framework that uses emotional cues to guide the model's focus, much like how feelings contribute to human decision-making.
From Parameters to Behaviors: Unsupervised Compression of the Policy Space
Davide Tenedini, Riccardo Zamboni, Mirco Mutti, Marcello Restelli · Sep 26, 2025

Despite its recent successes, Deep Reinforcement Learning (DRL) is notoriously sample-inefficient.
FeatBench: Towards More Realistic Evaluation of Feature-level Code Generation
Haorui Chen, Chengze Li, Jia Li · Sep 26, 2025

However, establishing a benchmark that faithfully mirrors realistic development scenarios remains a significant challenge.
LogiPart: Local Large Language Models for Data Exploration at Scale with Logical Partitioning
Tiago Fernandes Tavares · Sep 26, 2025

A qualitative audit by an independent LLM-as-a-judge confirms the discovery of meaningful functional axes, such as policy intent, that thematic ground-truth labels fail to capture.
SciTS: Scientific Time Series Understanding and Generation with LLMs
Wen Wu, Ziyang Zhang, Liwei Liu, Xuenan Xu, Jimin Zhuang · Sep 26, 2025

To address these gaps, we introduce SciTS, a benchmark spanning 12 scientific domains and 43 tasks, with over 50k+ instances, both univariate and multivariate signals ranging from $10^0$ to $10^7$ in length and up to 10~MHz in frequency.
CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning
Denis Makhov, Dmitriy Shopkhoev, Magauiya Zhussip, Ammar Ali, Stamatios Lefkimmiatis · Sep 26, 2025

Post-training compression of large language models (LLMs) often relies on low-rank weight approximations that represent each column of the weight matrix in a shared low-dimensional subspace.
Fine-tuning Done Right in Model Editing
Wanli Yang, Rui Tang, Hongyu Zang, Du Su, Qi Cao · Sep 26, 2025

Fine-tuning, a foundational method for adapting large language models, has long been considered ineffective for model editing.
ProPerSim: Developing Proactive and Personalized AI Assistants through User-Assistant Simulation
Jiho Kim, Junseong Choi, Woosog Chay, Daeun Kyung, Yeonsu Kwon · Sep 26, 2025

Pairwise Preference

In our simulation environment, a user agent with a rich persona interacts with the assistant, providing ratings on how well each suggestion aligns with its preferences and context.
Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training
Junkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong · Sep 25, 2025

Rubric Rating

Reinforcement fine-tuning (RFT) often suffers from reward over-optimization, where a policy model hacks the reward signals to achieve high scores while producing low-quality outputs.
UPDESH: Synthesizing Grounded Instruction Tuning Data for 13 Indic Languages
Pranjal A. Chitale, Varun Gumma, Sanchit Ahuja, Prashant Kodali, Manan Uppadhyay · Sep 25, 2025

Comprehensive evaluation using automated metrics and 10K human assessments confirms high data quality.
EpidemIQs: Prompt-to-Paper LLM Agents for Epidemic Modeling and Analysis
Mohammad Hossein Samaei, Faryad Darabi Sahneh, Lee W. Cohnstaedt, Caterina Scoglio · Sep 24, 2025

Expert Verification Multi Agent

We introduce EpidemIQs, a novel multi-agent LLM framework that integrates user inputs and autonomously conducts literature review, analytical derivation, network modeling, mechanistic modeling, stochastic simulations, data visualization and
Diversity Boosts AI-Generated Text Detection
Advik Raj Basani, Pin-Yu Chen · Sep 23, 2025

Motivated by the observation that human-authored text exhibits richer variability in lexical and structural unpredictability than LLM outputs, DivEye captures this signal through a set of interpretable statistical features.

Recent Daily Archives

fortnight-2026-f04 (335) week-2026-w08 (288) quarter-2026-q1 (732) month-2026-02 (682) fortnight-2026-f05 (323) week-2026-w09 (323) 2026-02-23 (56) 2026-02-25 (89) 2026-02-24 (118) month-2026-01 (50) 2026-02-20 (37) week-2026-w07 (47) 2026-02-26 (60) quarter-2025-q4 (124) 2026-02-18 (56) 2026-02-21 (22) 2026-02-17 (54) 2026-02-19 (60) fortnight-2026-f02 (27) 2026-02-22 (23) 2026-02-16 (36) month-2025-10 (69) quarter-2025-q2 (78) fortnight-2026-f01 (16) fortnight-2026-f03 (34) quarter-2025-q3 (90) week-2026-w06 (23) month-2025-12 (30) fortnight-2025-f22 (32) quarter-2025-q1 (35) month-2025-06 (39) month-2025-09 (44) month-2025-11 (25) week-2026-w03 (15) week-2026-w04 (12) 2026-02-13 (8) fortnight-2025-f21 (32) 2026-02-15 (7) fortnight-2025-f12 (29) week-2025-w39 (21)

HFEPX Fortnight Archive: 2025-F20

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Papers Published On This Date

Recent Daily Archives