HFEPX Archive Slice

HFEPX Weekly Archive: 2026-W04

Updated from current HFEPX corpus (Apr 12, 2026). 40 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 40 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: GSM8K. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Jan 25, 2026.

Papers: 40 Last published: Jan 25, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

High-Signal Coverage

100.0%

40 / 40 papers are not low-signal flagged.

Benchmark Anchors

20.0%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

30.0%

Papers with reported metric mentions in extraction output.

3 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.

Get this digest every Friday →

Why This Time Slice Matters

12.5% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 25% of papers in this hub.
GSM8K is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is rater calibration (5% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models
Jan 21, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy
Human Values in a Single Sentence: Moral Presence, Hierarchies, and Transformer Ensembles on the Schwartz Continuum
Jan 20, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: F1, F1 macro
Chain-of-Thought Compression Should Not Be Blind: V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring
Jan 20, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy, Latency
ChartAttack: Testing the Vulnerability of LLMs to Malicious Prompting in Chart Generation
Jan 19, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy
Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization
Jan 24, 2026 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Task success
IntelliAsk: Learning to Ask High-Quality Research Questions via RLVR
Jan 23, 2026 · Citations: 0 · Score: 5.5

Eval: Human Eval · Metrics: Not reported

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models Jan 21, 2026	Automatic Metrics	GSM8K	Accuracy	Not reported
Human Values in a Single Sentence: Moral Presence, Hierarchies, and Transformer Ensembles on the Schwartz Continuum Jan 20, 2026	Automatic Metrics	Valueeval	F1, F1 macro	Not reported
Chain-of-Thought Compression Should Not Be Blind: V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring Jan 20, 2026	Automatic Metrics	DocVQA	Accuracy, Latency	Not reported
ChartAttack: Testing the Vulnerability of LLMs to Malicious Prompting in Chart Generation Jan 19, 2026	Automatic Metrics	DROP	Accuracy	Not reported
Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization Jan 24, 2026	Automatic Metrics	Not reported	Task success	Not reported
IntelliAsk: Learning to Ask High-Quality Research Questions via RLVR Jan 23, 2026	Human Eval	Writingbench	Not reported	Not reported
Large Language Models as Automatic Annotators and Annotation Adjudicators for Fine-Grained Opinion Analysis Jan 23, 2026	Automatic Metrics	Not reported	Agreement, Cost	Adjudication, Inter Annotator Agreement Reported
RebuttalAgent: Strategic Persuasion in Academic Rebuttal via Theory of Mind Jan 22, 2026	Human Eval	Rebuttalbench	Not reported	Not reported
APEX-Agents Jan 20, 2026	Automatic Metrics	Not reported	Pass@1	Not reported
Forest-Chat: Adapting Vision-Language Agents for Interactive Forest Change Analysis Jan 21, 2026	Automatic Metrics	Not reported	Bleu	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (12.5% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (7.5% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (12.5% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (17.5% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (12.5% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (10% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 7.5% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (12.5% coverage).
Annotation unit is under-specified (10% coverage).

Suggested Next Analyses

Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.
Stratify by benchmark (GSM8K vs Lawbench) before comparing methods.
Track metric sensitivity by reporting both accuracy and coherence.

Recommended Queries

Human Eval Protocols Benchmark Slice: GSM8K Metric Slice: accuracy IAA-Reported Evaluations Recent High-Signal Papers

Known Limitations

Only 7.5% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (12.5% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (10)
Human Eval (3)
Simulation Env (1)

Top Metrics

Accuracy (4)
Coherence (1)
Jailbreak success rate (1)
Pass@1 (1)

Top Benchmarks

GSM8K (1)
Lawbench (1)
Rebuttalbench (1)
SummEval (1)

Quality Controls

Calibration (2)
Adjudication (1)
Inter Annotator Agreement Reported (1)

Papers In This Archive Slice

EFT-CoT: A Multi-Agent Chain-of-Thought Framework for Emotion-Focused Therapy
Lanqing Du, Yunong Li, YuJie Long, Shihong Chen · Jan 25, 2026 · Citations: 0

Multi Agent

To address this gap, we propose EFT-CoT, a multi-agent chain-of-thought framework grounded in Emotion-Focused Therapy (EFT).
Generation-Step-Aware Framework for Cross-Modal Representation and Control in Multilingual Speech-Text Models
Toshiki Nakai, Varsha Suresh, Vera Demberg · Jan 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization
Jingyi Xu, Xingyu Ren, Zhoupeng Shou, Yumeng Zhang, Zhiqiang You · Jan 24, 2026 · Citations: 0

Pairwise Preference Long Horizon

To address this, we propose Goal-Oriented Preference Optimization (GOPO), a hierarchical reinforcement learning framework that decouples strategy planning from response generation via an Expert Agent and a Customer Service Agent.
Building Safe and Deployable Clinical Natural Language Processing under Temporal Leakage Constraints
Ha Na Cho, Sairam Sutari, Alexander Lopez, Hansen Bow, Kai Zheng · Jan 24, 2026 · Citations: 0

Such behavior poses substantial risks for real-world deployment, where overconfident or temporally invalid predictions can disrupt clinical workflows and compromise patient safety.
IntelliAsk: Learning to Ask High-Quality Research Questions via RLVR
Karun Sharma, Vidushee Vats, Shengzhi Li, Yuxiang Wang, Zhongtian Sun · Jan 23, 2026 · Citations: 0

Pairwise PreferenceExpert Verification

Peer review relies on substantive, evidence-based questions, yet current LLMs generate surface-level queries that perform worse than human reviewer questions in expert evaluation.
Large Language Models as Automatic Annotators and Annotation Adjudicators for Fine-Grained Opinion Analysis
Gaurav Negi, MA Waskow, John McCrae, Paul Buitelaar · Jan 23, 2026 · Citations: 0

Although this level of detail is sound, it requires considerable human effort and substantial cost to annotate opinions in datasets for training models, especially across diverse domains and real-world applications.
The Mouth is Not the Brain: Bridging Energy-Based World Models and Language Generation
Junichiro Niimi · Jan 23, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Jacobian Scopes: token-level causal attributions in LLMs
Toni J. B. Liu, Baran Zadeoğlu, Nicolas Boullé, Raphaël Sarfati, Christopher J. Earls · Jan 23, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
PhysE-Inv: A Physics-Encoded Inverse Modeling approach for Arctic Snow Depth Prediction
Akila Sampath, Vandana Janeja, Jianwu Wang · Jan 23, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Where is the multimodal goal post? On the Ability of Foundation Models to Recognize Contextually Important Moments
Aditya K Surikuchi, Raquel Fernández, Sandro Pezzelle · Jan 22, 2026 · Citations: 0

Pairwise Preference

To this end, we construct a new dataset by leveraging human preferences for importance implicit in football game highlight reels, without any additional annotation costs.
A Longitudinal, Multinational, and Multilingual Corpus of News Coverage of the Russo-Ukrainian War
Dikshya Mohanty, Taisiia Sabadyn, Jelwin Rodrigues, Chenlu Wang, Abhishek Kalugade · Jan 22, 2026 · Citations: 0

The corpus features comprehensive metadata and human-evaluated annotations for stance, sentiment, and topical framing, enabling systematic analysis of competing geopolitical narratives.
Computer Environments Elicit General Agentic Intelligence in LLMs
Daixuan Cheng, Shaohan Huang, Yuxian Gu, Huatong Song, Guoxin Chen · Jan 22, 2026 · Citations: 0

Agentic intelligence in large language models (LLMs) requires not only model intrinsic capabilities but also interactions with external environments.
Between Search and Platform: ChatGPT Under the DSA
Toni Lorente, Kathrin Gardhouse · Jan 22, 2026 · Citations: 0

Web Browsing

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models
Shir Ashury-Tahan, Yifan Mai, Elron Bandel, Michal Shmueli-Scheuer, Leshem Choshen · Jan 22, 2026 · Citations: 0

Large Language Models (LLM) benchmarks tell us when models fail, but not why they fail.
RebuttalAgent: Strategic Persuasion in Academic Rebuttal via Theory of Mind
Zhitao He, Zongwei Lyu, Yi R Fung · Jan 22, 2026 · Citations: 0

Pairwise PreferenceCritique Edit

In this paper, we introduce RebuttalAgent, the first framework to ground academic rebuttal in Theory of Mind (ToM), operationalized through a ToM-Strategy-Response (TSR) framework that models reviewer mental state, formulates persuasion…
What Patients Really Ask: Exploring the Effect of False Assumptions in Patient Information Seeking
Raymond Xiong, Furong Jia, Lionel Wong, Monica Agrawal · Jan 22, 2026 · Citations: 0

However, benchmarking efforts in LLMs for question answering often focus on medical exam questions, which differ significantly in style and content from the questions patients actually raise in real life.
The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models
Zanlin Ni, Shenzhi Wang, Yang Yue, Tianyu Yu, Weilin Zhao · Jan 21, 2026 · Citations: 0

Long Horizon

We demonstrate that effective reasoning can be better elicited by intentionally forgoing arbitrary order and applying standard Group Relative Policy Optimization (GRPO) instead.
Knowledge Graphs are Implicit Reward Models: Path-Derived Signals Enable Compositional Reasoning
Yuval Kansal, Niraj K. Jha · Jan 21, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Mechanism Shift During Post-training from Autoregressive to Masked Diffusion Language Models
Injin Kong, Hyoungjoon Lee, Yohan Jo · Jan 21, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks
Zixuan Ke, Yifei Ming, Austin Xu, Ryan Chin, Xuan-Phi Nguyen · Jan 21, 2026 · Citations: 0

Multi Agent

While multi-agent systems (MAS) promise elevated intelligence through coordination of agents, current approaches to automatic MAS design under-deliver.
Forest-Chat: Adapting Vision-Language Agents for Interactive Forest Change Analysis
James Brock, Ce Zhang, Nantheera Anantrasirichai · Jan 21, 2026 · Citations: 0

This paper introduces Forest-Chat, an LLM-driven agent for forest change analysis, enabling natural language querying across multiple RSICI tasks, including change detection and captioning, object counting, deforestation characterisation,…
From Toil to Thought: Designing for Strategic Exploration and Responsible AI in Systematic Literature Reviews
Runlong Ye, Naaz Sibia, Angela Zavaleta Bernuy, Tingting Zhu, Carolina Nobre · Jan 21, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Compounding Disadvantage: Auditing Intersectional Bias in LLM-Generated Explanations Across Indian and American STEM Education
Amogh Gupta, Niharika Patil, Sourojit Ghosh, SnehalKumar, S Gaikwad · Jan 20, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
VisTIRA: Closing the Image-Text Modality Gap in Visual Math Reasoning via Structured Tool Integration
Saeed Khaki, Ashudeep Singh, Nima Safaei, Kamal Ginotra · Jan 20, 2026 · Citations: 0

First, we introduce VisTIRA (Vision and Tool-Integrated Reasoning Agent), a tool-integrated reasoning framework that enables structured problem solving by iteratively decomposing a given math problem (as an image) into natural language…
APEX-Agents
Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman · Jan 20, 2026 · Citations: 0

Rubric RatingExpert Verification Long Horizon

We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate…
Human Values in a Single Sentence: Moral Presence, Hierarchies, and Transformer Ensembles on the Schwartz Continuum
Víctor Yeste, Paolo Rosso · Jan 20, 2026 · Citations: 0

We study sentence-level detection of the 19 human values in the refined Schwartz continuum in about 74k English sentences from news and political manifestos (ValueEval'24 corpus).
Agentic SPARQL: Evaluating SPARQL-MCP-powered Intelligent Agents on the Federated KGQA Benchmark
Daniel Dobriy, Frederik Bauer, Amr Azzam, Debayan Banerjee, Axel Polleres · Jan 20, 2026 · Citations: 0
Chain-of-Thought Compression Should Not Be Blind: V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring
Dongxu Zhang, Yiding Sun, Cheng Tan, Wenbiao Yan, Ning Yang · Jan 20, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Habibi: Laying the Open-Source Foundation of Unified-Dialectal Arabic Speech Synthesis
Yushen Chen, Junzhe Liu, Yujie Tu, Zhikang Niu, Yuzhe Liang · Jan 20, 2026 · Citations: 0

Long Horizon

Key barriers include substantial cross-dialect lexical and phonological divergence, scarce synthesis-grade data, and the absence of a standardized multi-dialect evaluation benchmark.
Yuan3.0 Ultra: A Trillion-Parameter Enterprise-Oriented MoE LLM
YuanLab. ai, :, Shawn Wu, Jiangang Luo, Darcy Chen · Jan 20, 2026 · Citations: 0
Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search
Xinlei Yin, Xiulian Peng, Xiao Li, Zhiwei Xiong, Yan Lu · Jan 20, 2026 · Citations: 0
Vulnerability of LLMs' Stated Beliefs? LLMs Belief Resistance Check Through Strategic Persuasive Conversation Interventions
Fan Huang, Haewoon Kwak, Jisun An · Jan 20, 2026 · Citations: 0

We present a systematic evaluation of LLM susceptibility to persuasion under the Source--Message--Channel--Receiver (SMCR) communication framework.
RAGExplorer: A Visual Analytics System for the Comparative Diagnosis of RAG Systems
Haoyu Tian, Yingchaojie Feng, Zhen Wen, Haoxuan Li, Minfeng Zhu · Jan 19, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
ChartAttack: Testing the Vulnerability of LLMs to Malicious Prompting in Chart Generation
Jesus-German Ortiz-Barajas, Jonathan Tonglet, Vivek Gupta, Iryna Gurevych · Jan 19, 2026 · Citations: 0

Preliminary human results (limited sample size) indicate a 20.2-point accuracy drop.
A Component-Based Survey of Interactions between Large Language Models and Multi-Armed Bandits
Siguang Chen, Chunli Lv, Miao Xie · Jan 19, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
SciCoQA: Quality Assurance for Scientific Paper--Code Alignment
Tim Baumgärtner, Iryna Gurevych · Jan 19, 2026 · Citations: 0

Our evaluation of 22 LLMs demonstrates the difficulty of SciCoQA, particularly for instances involving omitted paper details, long-context inputs, and data outside the models' pre-training corpus.
YOLO26: An Analysis of NMS-Free End to End Framework for Real-Time Object Detection
Sudip Chakrabarty · Jan 19, 2026 · Citations: 0

To contextualize its performance, this article reviews exhaustive benchmark data from the COCO val2017 leaderboard.
When LLMs Imagine People: A Human-Centered Persona Brainstorm Audit for Bias and Fairness in Creative Applications
Hongliu Cao, Eoin Thomas, Rodrigo Acuna Agost · Jan 19, 2026 · Citations: 0

Existing methods rely on constrained tasks and fixed benchmarks, leaving open-ended creative outputs unexamined.
Multimodal Multi-Agent Empowered Legal Judgment Prediction
Zhaolu Kang, Junhao Gong, Qingxi Chen, Hao Zhang, Jiaxin Liu · Jan 19, 2026 · Citations: 0

Multi Agent

Furthermore, we build JurisMM, a large dataset with over 100,000 recent Chinese judicial records, including both text and multimodal video-text data, enabling comprehensive evaluation.
Empowering All-in-Loop Health Management of Spacecraft Power System in the Mega-Constellation Era via Human-AI Collaboration
Yi Di, Zhibin Zhao, Fujin Wang, Xue Liu, Jiafeng Tang · Jan 19, 2026 · Citations: 0

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now