HFEPX Archive Slice

HFEPX Weekly Archive: 2025-W52

Updated from current HFEPX corpus (Apr 17, 2026). 29 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 17, 2026). 29 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequent quality control: Gold Questions. Frequently cited benchmark: DROP. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Dec 28, 2025.

Papers: 29 Last published: Dec 28, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

High-Signal Coverage

100.0%

29 / 29 papers are not low-signal flagged.

Benchmark Anchors

6.9%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

31.0%

Papers with reported metric mentions in extraction output.

1 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Get this digest every Friday →

Why This Time Slice Matters

3.4% of papers report explicit human-feedback signals, led by expert verification.
automatic metrics appears in 31% of papers in this hub.
DROP is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is gold-question checks (3.4% of papers).
Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics
Dec 26, 2025 · Citations: 0 · Score: 7.5

Eval: Automatic Metrics · Metrics: Accuracy
Diversity or Precision? A Deep Dive into Next Token Prediction
Dec 28, 2025 · Citations: 0 · Score: 4.0

Eval: Automatic Metrics · Metrics: Precision
Beg to Differ: Understanding Reasoning-Answer Misalignment Across Languages
Dec 27, 2025 · Citations: 0 · Score: 4.0

Eval: Automatic Metrics · Metrics: Accuracy
Hallucination Detection and Evaluation of Large Language Model
Dec 27, 2025 · Citations: 0 · Score: 4.0

Eval: Automatic Metrics · Metrics: Accuracy
Large Language Models Approach Expert Pedagogical Quality in Math Tutoring but Differ in Instructional and Linguistic Profiles
Dec 23, 2025 · Citations: 0 · Score: 4.0

Eval: Automatic Metrics · Metrics: Accuracy
DIAL: Direct Iterative Adversarial Learning for Realistic Multi-Turn Dialogue Simulation
Dec 23, 2025 · Citations: 0 · Score: 4.0

Eval: Automatic Metrics, Simulation Env · Metrics: Accuracy, Cost

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics Dec 26, 2025	Automatic Metrics	DROP, BIRD	Accuracy	Gold Questions
Diversity or Precision? A Deep Dive into Next Token Prediction Dec 28, 2025	Automatic Metrics	Not reported	Precision	Not reported
Beg to Differ: Understanding Reasoning-Answer Misalignment Across Languages Dec 27, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
Hallucination Detection and Evaluation of Large Language Model Dec 27, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
Large Language Models Approach Expert Pedagogical Quality in Math Tutoring but Differ in Instructional and Linguistic Profiles Dec 23, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
DIAL: Direct Iterative Adversarial Learning for Realistic Multi-Turn Dialogue Simulation Dec 23, 2025	Automatic Metrics, Simulation Env	Not reported	Accuracy, Cost	Not reported
AgentMath: Empowering Mathematical Reasoning for Large Language Models via Tool-Augmented Agent Dec 23, 2025	Automatic Metrics	Not reported	Accuracy, Precision	Not reported
Reason2Decide: Rationale-Driven Multi-Task Learning Dec 23, 2025	Llm As Judge, Automatic Metrics	Not reported	Accuracy, F1	Not reported
On the Existence and Behavior of Secondary Attention Sinks Dec 22, 2025	Automatic Metrics	Not reported	Relevance	Not reported
CycleChart: A Unified Consistency-Based Learning Framework for Bidirectional Chart Understanding and Generation Dec 22, 2025	Not reported	ChartQA, Cyclechart Bench	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (3.4% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (3.4% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (6.9% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (24.1% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (10.3% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (6.9% vs 35% target).

Strengths

Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.

Known Gaps

Only 3.4% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (10.3% coverage).
Annotation unit is under-specified (6.9% coverage).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (DROP vs BIRD) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

Judge vs Human Agreement Benchmark Slice: DROP Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 3.4% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (10.3% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (9)
Human Eval (1)
Llm As Judge (1)
Simulation Env (1)

Top Metrics

Accuracy (5)
Cost (2)
Precision (2)
Bertscore (1)

Top Benchmarks

DROP (2)
BIRD (1)
Cricbench (1)

Quality Controls

Gold Questions (1)

Papers In This Archive Slice

Diversity or Precision? A Deep Dive into Next Token Prediction
Haoyuan Wu, Hai Wang, Jiajia Wu, Jinxiang Ou, Keyao Wang · Dec 28, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Beg to Differ: Understanding Reasoning-Answer Misalignment Across Languages
Anaelia Ovalle, Candace Ross, Sebastian Ruder, Adina Williams, Karen Ullrich · Dec 27, 2025 · Citations: 0

We introduce a human-validated framework to evaluate whether model-generated reasoning traces logically support their conclusions across languages.
Syntactic Framing Fragility: An Audit of Robustness in LLM Ethical Decisions
Katherine Elkins, Jon Chun · Dec 27, 2025 · Citations: 0

Negation-bearing syntax is the dominant failure mode, with some models endorsing actions at 80-97% rates even when asked whether agents not act.
Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds
Naman Agarwal, Siddhartha R. Dalal, Vishal Misra · Dec 27, 2025 · Citations: 0
Geometric Scaling of Bayesian Inference in LLMs
Naman Agarwal, Siddhartha R. Dalal, Vishal Misra · Dec 27, 2025 · Citations: 0
The Bayesian Geometry of Transformer Attention
Naman Agarwal, Siddhartha R. Dalal, Vishal Misra · Dec 27, 2025 · Citations: 0
Hallucination Detection and Evaluation of Large Language Model
Chenggong Zhang, Haopeng Wang, Hexi Meng · Dec 27, 2025 · Citations: 0

To address this, we integrate the Hughes Hallucination Evaluation Model (HHEM), a lightweight classification-based framework that operates independently of LLM-based judgments, significantly improving efficiency while maintaining high…
Intrinsic-Metric Physics-Informed Neural Networks (IM-PINN) for Reaction-Diffusion Dynamics on Complex Riemannian Manifolds
Julian Evan Chrisnanto, Salsabila Rahma Alia, Nurfauzi Fadillah, Yulison Herry Chrisnanto · Dec 26, 2025 · Citations: 0

Benchmarking against the Surface Finite Element Method (SFEM) reveals superior physical rigor: the IM-PINN achieves global mass conservation error of E_{mass} \approx 0.157 versus SFEM's 0.258, acting as a thermodynamically consistent…
CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics
Vaibhav Devraj, Dhruv Kumar, Jagat Sesh Challa, Parth Agarwal, Navya Kommuri · Dec 26, 2025 · Citations: 0

Expert Verification

To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data.
Ara-HOPE: Human-Centric Post-Editing Evaluation for Dialectal Arabic to Modern Standard Arabic Translation
Abdullah Alabdullah, Lifeng Han, Chenghua Lin · Dec 25, 2025 · Citations: 0

Existing automatic evaluation metrics and general-purpose human evaluation frameworks struggle to capture dialect-specific MT errors, hindering progress in translation assessment.
Measuring all the noises of LLM Evals
Sida Wang · Dec 24, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Parallel Token Prediction for Language Models
Felix Draxler, Justus Will, Farrin Marouf Sofian, Theofanis Karaletsos, Sameer Singh · Dec 24, 2025 · Citations: 0
Schrödinger's Navigator: Imagining an Ensemble of Futures for Zero-Shot Object Navigation
Yu He, Da Huang, Zhenyang Liu, Zixiao Gu, Qiang Sun · Dec 24, 2025 · Citations: 0
Semantic Refinement with LLMs for Graph Representations
Safal Thapaliya, Zehong Wang, Jiazheng Li, Ziming Li, Yanfang Ye · Dec 24, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Agentic Explainable Artificial Intelligence (Agentic XAI) Approach To Explore Better Explanation
Tomoaki Yamaguchi, Yutong Zhou, Masahiro Ryo, Keisuke Katsura · Dec 24, 2025 · Citations: 0
Where Did This Sentence Come From? Tracing Provenance in LLM Reasoning Distillation
Kaiyuan Liu, Shaotian Yan, Rui Miao, Bing Wang, Chen Shen · Dec 24, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Large Language Models Approach Expert Pedagogical Quality in Math Tutoring but Differ in Instructional and Linguistic Profiles
Ramatu Oiza Abdulsalam, Segun Aroyehun · Dec 23, 2025 · Citations: 0

Recent work has explored the use of large language models (LLMs) to generate tutoring responses in mathematics, yet it remains unclear how closely their instructional behavior aligns with expert human practice.
DIAL: Direct Iterative Adversarial Learning for Realistic Multi-Turn Dialogue Simulation
Ziyi Zhu, Olivier Tieleman, Caitlin A. Stamatis, Luka Smyth, Thomas D. Hull · Dec 23, 2025 · Citations: 0

Realistic user simulation is crucial for training and evaluating multi-turn dialogue systems, yet creating simulators that accurately replicate human behavior remains a significant challenge.
Generalization of RLVR Using Causal Reasoning as a Testbed
Brian Lu, Hongyu Zhao, Shuo Sun, Hao Peng, Rui Ding · Dec 23, 2025 · Citations: 0
AgentMath: Empowering Mathematical Reasoning for Large Language Models via Tool-Augmented Agent
Haipeng Luo, Huawen Feng, Qingfeng Sun, Can Xu, Kai Zheng · Dec 23, 2025 · Citations: 0

In this work, we present AgentMath, an agent framework that seamlessly integrates language models' reasoning capabilities with code interpreters' computational precision to efficiently tackle complex mathematical problems.
Coherence in the brain unfolds across separable temporal regimes
Davide Staub, Finn Rabe, Akhil Misra, Yves Pauli, Roya Hüppi · Dec 23, 2025 · Citations: 0
Reason2Decide: Rationale-Driven Multi-Task Learning
H M Quamran Hasan, Housam Khalifa Bashier, Jiayi Dai, Mi-Young Kim, Randy Goebel · Dec 23, 2025 · Citations: 0

Across model sizes, Reason2Decide outperforms other fine-tuning baselines and some zero-shot LLMs in prediction (F1) and rationale fidelity (BERTScore, BLEU, LLM-as-a-Judge).
Geometric Organization of Cognitive States in Transformer Embedding Spaces
Sophie Zhao · Dec 23, 2025 · Citations: 0
Neuron-Guided Interpretation of Code LLMs: Where, Why, and How?
Zhe Yin, Xiaodong Gu, Beijun Shen · Dec 23, 2025 · Citations: 0
Machine Unlearning in the Era of Quantum Machine Learning: An Empirical Study
Carla Crivoi, Radu Tudor Ionescu · Dec 22, 2025 · Citations: 0
CycleChart: A Unified Consistency-Based Learning Framework for Bidirectional Chart Understanding and Generation
Dazhen Deng, Sen Yang, Yuchen He, Yuan Tian, Yingcai Wu · Dec 22, 2025 · Citations: 0

To support this framework, we construct CycleChart-Bench, a lifecycle-aligned benchmark where every chart sample carries aligned annotations for generation, schema parsing, data parsing, and question answering.
On the Existence and Behavior of Secondary Attention Sinks
Jeffrey T. H. Wong, Cheng Zhang, Louis Mahon, Wayne Luk, Anton Isopoussu · Dec 22, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Stop saying LLM: Large Discourse Models (LDM) and Artificial Discursive Agent (ADA)?
Amar Lakel · Dec 22, 2025 · Citations: 0

This paper proposes an epistemological shift in the analysis of large generative models, replacing the category ''Large Language Models'' (LLM) with that of ''Large Discourse Models'' (LDM), and then with that of Artificial Discursive Agent…
Training-Free Global Geometric Association for 4D LiDAR Panoptic Segmentation
Gyeongrok Oh, Youngdong Jang, Jonghyun Choi, Suk-Ju Kang, Guang Lin · Dec 22, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now