HFEPX Archive Slice

HFEPX Quarterly Archive: 2024-Q4

Updated from current HFEPX corpus (Mar 10, 2026). 28 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Mar 10, 2026). 28 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Calibration. Frequently cited benchmark: Biggenbench. Common metric signal: agreement. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Dec 31, 2024.

Papers: 28 Last published: Dec 31, 2024 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

High-Signal Coverage

100.0%

28 / 28 papers are not low-signal flagged.

Benchmark Anchors

10.7%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

21.4%

Papers with reported metric mentions in extraction output.

2 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Get this digest every Friday →

Why This Time Slice Matters

10.7% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 14.3% of papers in this hub.
Biggenbench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is rater calibration (3.6% of papers).
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

LMUnit: Fine-grained Evaluation with Natural Language Unit Tests
Dec 17, 2024 · Citations: 0 · Score: 6.4

Eval: Human Eval · Metrics: Agreement
GRASP: Replace Redundant Layers with Adaptive Singular Parameters for Efficient Model Compression
Dec 31, 2024 · Citations: 0 · Score: 2.9

Eval: Not reported · Metrics: Cost, Inference cost
Multi-modal, Multi-task, Multi-criteria Automatic Evaluation with Vision Language Models
Dec 19, 2024 · Citations: 0 · Score: 2.9

Eval: Human Eval · Metrics: Not reported
Predicting Subway Passenger Flows under Incident Situation with Causality
Dec 9, 2024 · Citations: 0 · Score: 2.9

Eval: Automatic Metrics · Metrics: Accuracy
Just KIDDIN: Knowledge Infusion and Distillation for Detection of INdecent Memes
Nov 19, 2024 · Citations: 0 · Score: 2.9

Eval: Automatic Metrics · Metrics: F1, Recall
Renaissance: Investigating the Pretraining of Vision-Language Encoders
Nov 11, 2024 · Citations: 0 · Score: 2.9

Eval: Automatic Metrics · Metrics: Cost

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
LMUnit: Fine-grained Evaluation with Natural Language Unit Tests Dec 17, 2024	Human Eval	Biggenbench, Rewardbench	Agreement	Inter Annotator Agreement Reported
GRASP: Replace Redundant Layers with Adaptive Singular Parameters for Efficient Model Compression Dec 31, 2024	Not reported	Not reported	Cost, Inference cost	Calibration
Multi-modal, Multi-task, Multi-criteria Automatic Evaluation with Vision Language Models Dec 19, 2024	Human Eval	Harmoniceval	Not reported	Not reported
Predicting Subway Passenger Flows under Incident Situation with Causality Dec 9, 2024	Automatic Metrics	Not reported	Accuracy	Not reported
Just KIDDIN: Knowledge Infusion and Distillation for Detection of INdecent Memes Nov 19, 2024	Automatic Metrics	Not reported	F1, Recall	Not reported
Renaissance: Investigating the Pretraining of Vision-Language Encoders Nov 11, 2024	Automatic Metrics	Not reported	Cost	Not reported
LLM2CLIP: Powerful Language Model Unlocks Richer Cross-Modality Representation Nov 7, 2024	Automatic Metrics	Not reported	Cost	Not reported
Diverging Preferences: When do Annotators Disagree and do Models Know? Oct 18, 2024	Llm As Judge	Not reported	Not reported	Not reported
Evaluating LLMs' Divergent Thinking Capabilities for Scientific Idea Generation with Minimal Context Dec 23, 2024	Not reported	Liveideabench	Not reported	Not reported
Efficient Context Propagating Perceiver Architectures for Auto-Regressive Language Modeling Dec 8, 2024	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (10.7% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (7.1% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (7.1% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (7.1% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (3.6% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (3.6% vs 35% target).

Strengths

Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.

Known Gaps

Only 7.1% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (3.6% coverage).
Annotation unit is under-specified (3.6% coverage).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (Biggenbench vs KG-retrieval) before comparing methods.
Track metric sensitivity by reporting both agreement and latency.

Recommended Queries

Judge vs Human Agreement Benchmark Slice: Biggenbench Metric Slice: agreement IAA-Reported Evaluations Recent High-Signal Papers

Known Limitations

Only 7.1% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (3.6% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (4)
Human Eval (2)
Llm As Judge (1)
Simulation Env (1)

Top Metrics

Agreement (1)
Latency (1)

Top Benchmarks

Biggenbench (1)
KG Retrieval (1)
Rewardbench (1)

Quality Controls

Calibration (1)
Inter Annotator Agreement Reported (1)

Papers In This Archive Slice

GRASP: Replace Redundant Layers with Adaptive Singular Parameters for Efficient Model Compression
Kainan Liu, Yong Zhang, Ning Cheng, Zhitao Li, Shaojun Wang · Dec 31, 2024 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Evaluating LLMs' Divergent Thinking Capabilities for Scientific Idea Generation with Minimal Context
Kai Ruan, Xuan Wang, Jixiang Hong, Peng Wang, Yang Liu · Dec 23, 2024 · Citations: 0

While Large Language Models (LLMs) demonstrate remarkable capabilities in scientific tasks such as literature analysis and experimental design (e.g., accurately extracting key findings from papers or generating coherent experimental…
A Survey of Query Optimization in Large Language Models
Mingyang Song, Mao Zheng · Dec 23, 2024 · Citations: 0

We further examine evaluation methodologies, identify critical gaps in existing benchmarks, and discuss open challenges including process reward models, efficiency optimization, and multi-modal query handling.
LLM4AD: A Platform for Algorithm Design with Large Language Model
Fei Liu, Rui Zhang, Zhuoliang Xie, Rui Sun, Kai Li · Dec 23, 2024 · Citations: 0

We have also designed a unified evaluation sandbox to ensure a secure and robust assessment of algorithms.
Multi-modal, Multi-task, Multi-criteria Automatic Evaluation with Vision Language Models
Masanari Ohi, Masahiro Kaneko, Naoaki Okazaki, Nakamasa Inoue · Dec 19, 2024 · Citations: 0

However, existing metrics for evaluating the quality of text generated by VLMs typically focus on an overall evaluation for a specific task, such as image captioning.
LMUnit: Fine-grained Evaluation with Natural Language Unit Tests
Jon Saad-Falcon, Rajan Vivek, William Berrios, Nandita Shankar Naik, Matija Franklin · Dec 17, 2024 · Citations: 0

Pairwise Preference

We introduce natural language unit tests, a paradigm that decomposes response quality into explicit, testable criteria, along with a unified scoring model, LMUnit, which combines multi-objective training across preferences, direct ratings,…
Efficient Continual Learning for Small Language Models with a Discrete Key-Value Bottleneck
Andor Diera, Lukas Galke, Fabian Karl, Ansgar Scherp · Dec 11, 2024 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
SpecFuse: Ensembling Large Language Models via Next-Segment Prediction
Bo Lv, Nayu Liu, Chen Tang, Xin Liu, Yue Yu · Dec 10, 2024 · Citations: 0

Experimental results on five LLM families (ranging from 7B to 72B parameters) and six benchmark datasets, spanning open-domain instruction following, reasoning, commonsense, demonstrate consistent performance improvements compared to…
Speaker effects in language comprehension: An integrative model of language and speaker processing
Hanlin Wu, Zhenguang G. Cai · Dec 10, 2024 · Citations: 0

We discuss how speaker effects serve as indices for assessing language development and social cognition, and we encourage future research to extend these findings to the emerging domain of artificial intelligence (AI) speakers, as AI agents…
Predicting Subway Passenger Flows under Incident Situation with Causality
Xiannan Huang, Shuhan Qiu, Quan Yuan, Chao Yang · Dec 9, 2024 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Efficient Context Propagating Perceiver Architectures for Auto-Regressive Language Modeling
Kaleel Mahmood, Shaoyi Huang · Dec 8, 2024 · Citations: 0

Pairwise Preference

To this end, we develop four new architectural paradigms, the best performing of which we denote as the Efficient Context propagating Perceiver (ECP).
A Contemporary Overview: Trends and Applications of Large Language Models on Mobile Devices
Lianjun Liu, Hongli An, Pengxuan Chen, Longxiang Ye · Dec 4, 2024 · Citations: 0
RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics
Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su · Nov 25, 2024 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Cautious Optimizers: Improving Training with One Line of Code
Kaizhao Liang, Lizhang Chen, Bo Liu, Qiang Liu · Nov 25, 2024 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Just KIDDIN: Knowledge Infusion and Distillation for Detection of INdecent Memes
Rahul Garg, Trilok Padhi, Hemang Jain, Ugur Kursuncu, Ponnurangam Kumaraguru · Nov 19, 2024 · Citations: 0

Experimental results from our study on two hate speech benchmark datasets demonstrate superior performance over the state-of-the-art baselines across AU-ROC, F1, and Recall with improvements of 1.1%, 7%, and 35%, respectively.
Federated Co-tuning Framework for Large and Small Language Models
Tao Fan, Yan Kang, Guoqiang Ma, Lixin Fan, Shuoling Liu · Nov 18, 2024 · Citations: 0

Our evaluation of FedCoLLM, utilizing various public LLMs and SLMs across a range of NLP text generation tasks, reveals that the performance of clients' SLMs experiences notable improvements with the assistance of the LLMs.
Personalized Help for Optimizing Low-Skilled Users' Strategy
Feng Gu, Wichayaporn Wongkamjan, Jonathan K. Kummerfeld, Denis Peskoff, Jonathan May · Nov 14, 2024 · Citations: 0

AIs can beat humans in game environments; however, how helpful those agents are to human remains understudied.
UniHR: Hierarchical Representation Learning for Unified Knowledge Graph Link Prediction
Zhiqiang Liu, Yin Hua, Mingyang Chen, Yichi Zhang, Zhuo Chen · Nov 11, 2024 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Renaissance: Investigating the Pretraining of Vision-Language Encoders
Clayton Fields, Casey Kennington · Nov 11, 2024 · Citations: 0

To conduct these experiments, we introduce a VL evaluation framework called Renaissance.
LLM2CLIP: Powerful Language Model Unlocks Richer Cross-Modality Representation
Weiquan Huang, Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang · Nov 7, 2024 · Citations: 0

The LLM-enhanced CLIP delivers consistent improvements across a wide range of downstream tasks, including linear-probe classification, zero-shot image-text retrieval with both short and long captions (in English and other languages),…
Llama-Mob: Instruction-Tuning Llama-3-8B Excels in City-Scale Mobility Prediction
Peizhi Tang, Chuang Yang, Tong Xing, Xiaohang Xu, Jiayi Xu · Oct 31, 2024 · Citations: 0

Human mobility prediction plays a critical role in applications such as disaster response, urban planning, and epidemic forecasting.
WAFFLE: Finetuning Multi-Modal Models for Automated Front-End Development
Shanchao Liang, Nan Jiang, Shangshu Qian, Lin Tan · Oct 24, 2024 · Citations: 0

Models fine-tuned with Waffle show up to 9.00 pp (percentage point) higher HTML match, 0.0982 higher CW-SSIM, 32.99 higher CLIP, and 27.12 pp higher LLEM on our new benchmark WebSight-Test and an existing benchmark Design2Code,…
Scaling Knowledge Graph Construction through Synthetic Data Generation and Distillation
Prafulla Kumar Choubey, Xin Su, Man Luo, Xiangyu Peng, Caiming Xiong · Oct 22, 2024 · Citations: 0
Uncovering Autoregressive LLM Knowledge of Thematic Fit in Event Representation
Safeyah Khaled Alshemali, Daniel Bauer, Yuval Marton · Oct 19, 2024 · Citations: 0

Long Horizon

We set a new state-of-the-art on thematic fit benchmarks, but show that closed and open weight LLMs respond differently to our prompting strategies: Closed models achieve better scores overall and benefit from multi-step reasoning, but they…
Diverging Preferences: When do Annotators Disagree and do Models Know?
Michael JQ Zhang, Zhilin Wang, Jena D. Hwang, Yi Dong, Olivier Delalleau · Oct 18, 2024 · Citations: 0

Pairwise Preference

In our experiments, we demonstrate how standard reward modeling (e.g., Bradley-Terry) and LLM-as-Judge evaluation methods fail to account for divergence between annotators.
SimpleToM: Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs
Yuling Gu, Oyvind Tafjord, Hyunwoo Kim, Jared Moore, Ronan Le Bras · Oct 17, 2024 · Citations: 0

Yet most evaluations stop at explicit belief attribution in classical toy stories or stylized tasks, leaving open the questions of whether LLMs can implicitly apply such knowledge to predict human behavior, or to judge an observed behavior,…
GLEE: A Unified Framework and Benchmark for Language-based Economic Environments
Eilam Shapira, Omer Madmon, Itamar Reinman, Samuel Joseph Amouyal, Roi Reichart · Oct 7, 2024 · Citations: 0

To answer these questions, we introduce a benchmark for standardizing research on two-player, sequential, language-based games.
A Watermark for Black-Box Language Models
Dara Bahri, John Wieting · Oct 2, 2024 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote