HFEPX Hub

CS.LG + Coding Papers

Updated from current HFEPX corpus (Mar 1, 2026). 22 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Mar 1, 2026). 22 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequently cited benchmark: Ad-Bench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 15, 2026.

Papers: 22 Last published: Feb 15, 2026 Global RSS Tag RSS

Cs.LGCoding

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (22) Replication-Ready Only (4)

High-Signal Coverage

100.0%

22 / 22 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

4 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
0 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Start with the top 2 papers in “Start Here”, then validate assumptions in the protocol matrix.

Why This Matters (Expanded)

Why This Matters For Eval Research

50% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 40.9% of papers in this hub.
Ad-Bench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Notes (Expanded)

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Benchmark Interpretation

Ad-Bench appears in 4.5% of hub papers (1/22); use this cohort for benchmark-matched comparisons.
ALFWorld appears in 4.5% of hub papers (1/22); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 18.2% of hub papers (4/22); compare with a secondary metric before ranking methods.
latency is reported in 13.6% of hub papers (3/22); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Strong: Papers with explicit human feedback

Coverage is strong (50% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (22.7% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (45.5% vs 35% target).
Moderate: Papers with known rater population

Coverage is usable but incomplete (22.7% vs 35% target).
Strong: Papers with known annotation unit

Coverage is strong (40.9% vs 35% target).

Strengths

Strong human-feedback signal (50% of papers).
Agentic evaluation appears in 59.1% of papers.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (22.7% coverage).

Suggested Next Analyses

Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.
Stratify by benchmark (Ad-Bench vs ALFWorld) before comparing methods.
Track metric sensitivity by reporting both accuracy and latency.

Recommended Queries (Expanded)

Recommended Queries

Human Eval Protocols Benchmark Slice: Ad-Bench Metric Slice: accuracy Recent High-Signal Papers

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents
Feb 15, 2026 · Citations: 0 · Score: 8.0

HF: Expert Verification · Eval: Simulation Env · Benchmark: Ad Bench · Metric: Pass@1
SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents
Feb 25, 2026 · Citations: 0 · Score: 6.0

HF: Not reported · Eval: Automatic Metrics · Benchmark: SWE Bench · Metric: Pass@1
Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception
Feb 12, 2026 · Citations: 0 · Score: 6.0

HF: Not reported · Eval: Automatic Metrics · Benchmark: Zoombench · Metric: Latency
AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering
Feb 8, 2026 · Citations: 0 · Score: 6.0

HF: Not reported · Eval: Automatic Metrics · Benchmark: MLE Bench · Metric: Latency
Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing
Oct 14, 2025 · Citations: 0 · Score: 5.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks
Feb 26, 2026 · Citations: 0 · Score: 4.5

HF: Not reported · Eval: Simulation Env · Benchmark: ALFWorld · Metric: Not Reported

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents Feb 15, 2026	Yes Expert Verification	Simulation Env	Ad Bench	Pass@1 , Pass@3	Not Reported
SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents Feb 25, 2026	No Not Reported	Automatic Metrics	SWE Bench , SWE Bench Verified	Pass@1 , Latency	Not Reported
Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception Feb 12, 2026	No Not Reported	Automatic Metrics	Zoombench	Latency	Not Reported
AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering Feb 8, 2026	No Not Reported	Automatic Metrics	MLE Bench	Latency	Not Reported
Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing Oct 14, 2025	Yes Pairwise Preference	Automatic Metrics	Not Reported	Accuracy	Not Reported
Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks Feb 26, 2026	No Not Reported	Simulation Env	ALFWorld , WebShop	Not Reported	Not Reported
The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems Feb 17, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported
Why Code, Why Now: Learnability, Computability, and the Real Limits of Machine Learning Feb 15, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported
Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation Feb 12, 2026	Yes Expert Verification	Not Reported	Not Reported	Not Reported	Not Reported
Inner Speech as Behavior Guides: Steerable Imitation of Diverse Behaviors for Human-AI coordination Feb 24, 2026	Yes Demonstrations	Not Reported	Not Reported	Not Reported	Not Reported
A Benchmark for Deep Information Synthesis Feb 24, 2026	No Not Reported	Automatic Metrics	Not Reported	F1	Not Reported
GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL Feb 25, 2026	No Not Reported	Automatic Metrics	Not Reported	Accuracy , Task success	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	AD-Bench: A Real-World, Trajectory-Aware Advertisin…	SWE-Protégé: Learning to Selectively Collaborate Wi…	Zooming without Zooming: Region-to-Image Distillati…
Human Feedback	Expert Verification	Not reported	Not reported
Evaluation Modes	Simulation Env	Automatic Metrics	Automatic Metrics
Benchmarks	Ad Bench	SWE Bench, SWE Bench Verified	Zoombench
Metrics	Pass@1, Pass@3	Pass@1, Latency	Latency
Quality Controls	Not reported	Not reported	Not reported
Rater Population	Domain Experts	Domain Experts	Unknown
Annotation Unit	Trajectory	Unknown	Unknown

Research Utility Snapshot

Human Feedback Mix

Pairwise Preference (5)
Demonstrations (3)
Expert Verification (2)
Rubric Rating (1)

Evaluation Modes

Automatic Metrics (9)
Simulation Env (6)
Human Eval (1)

Top Benchmarks

Ad Bench (1)
ALFWorld (1)
MLE Bench (1)
SWE Bench (1)

Top Metrics

Accuracy (4)
Latency (3)
F1 (2)
Pass@1 (2)

Rater Population Mix

Domain Experts (5)

Quality Controls

Coverage diagnostics (sample-based): human-feedback 50.0% · benchmarks 22.7% · metrics 45.5% · quality controls 0.0%.

Top Papers

AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents
Lingxiang Hu, Yiding Sun, Tianle Xia, Wenwei Li, Ming Xu · Feb 15, 2026 · Citations: 0

Expert Verification Simulation Env Long Horizon

While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real-world environments has become a critical problem.
Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
Yihe Deng, I-Hung Hsu, Jun Yan, Zifeng Wang, Rujun Han · Oct 29, 2025 · Citations: 0

Demonstrations Long Horizon

Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.
Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning
Ran Xu, Jingjing Chen, Jiayu Ye, Yu Wu, Jun Yan · Oct 27, 2025 · Citations: 0

Pairwise Preference Human Eval

Motivated by the success of tool-integrated reasoning (TIR) in numerous tasks, we propose TIR-Judge, an end-to-end RL framework for training LLM judges that integrates a code executor for precise evaluation.
Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks
Shuo He, Lang Feng, Qi Wei, Xin Cheng, Lei Feng · Feb 26, 2026 · Citations: 0

Simulation Env Long Horizon

Group-based reinforcement learning (RL), such as GRPO, has advanced the capabilities of large language models on long-horizon agentic tasks.
The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems
Xiaoze Liu, Ruowang Zhang, Weichen Yu, Siheng Xiong, Liu He · Feb 17, 2026 · Citations: 0

Pairwise Preference Multi Agent

Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain shackled by the inefficiency of discrete text communication, which imposes significant runtime overhead and…
CORE: Measuring Multi-Agent LLM Interaction Quality under Game-Theoretic Pressures
Punya Syon Pandey, Yongjin Yang, Jiarui Liu, Zhijing Jin · Aug 16, 2025 · Citations: 0

Pairwise Preference Multi Agent

Game-theoretic interactions between agents with Large Language Models (LLMs) have revealed many emergent capabilities, yet the linguistic diversity of these interactions has not been sufficiently quantified.
SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents
Patrick Tser Jern Kon, Archana Pradeep, Ang Chen, Alexander P. Ellis, Warren Hunt · Feb 25, 2026 · Citations: 0

Automatic Metrics Long Horizon

Our approach combines supervised fine-tuning on expert-augmented trajectories with agentic reinforcement learning that explicitly discourages degenerative looping and unproductive expert collaboration.
Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing
Rongzhi Zhang, Liqin Ye, Yuzhao Heng, Xiang Chen, Tong Yu · Oct 14, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

Finally, we demonstrate efficiency enhancements across three downstream tasks: preference data synthesis, Pareto frontier approximation and optimization, and distillation of aligned behaviors for intervention-free inference.
Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception
Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai · Feb 12, 2026 · Citations: 0

Automatic Metrics Tool Use

To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM.
AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering
Yuzhu Cai, Zexi Liu, Xinyu Zhu, Cheng Wang, Siheng Chen · Feb 8, 2026 · Citations: 0

Automatic Metrics Long Horizon

Autonomous Machine Learning Engineering (MLE) requires agents to perform sustained, iterative optimization over long horizons.
KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models
Zukang Xu, Zhixiong Zhao, Xing Hu, Zhixuan Chen, Dawei Yang · Jan 30, 2026 · Citations: 0

Automatic MetricsSimulation Env

Mixture of Experts (MoE) models have achieved great success by significantly improving performance while maintaining computational efficiency through sparse expert activation.
RHYTHM: Reasoning with Hierarchical Temporal Tokenization for Human Mobility
Haoyu He, Haozheng Luo, Yan Chen, Qi R. Wang · Sep 27, 2025 · Citations: 0

Automatic Metrics Long Horizon

Predicting human mobility is inherently challenging due to complex long-range dependencies and multi-scale periodic behaviors.
Why Code, Why Now: Learnability, Computability, and the Real Limits of Machine Learning
Zhimin Zhao · Feb 15, 2026 · Citations: 0

Pairwise Preference

We propose a five-level hierarchy of learnability based on information structure and argue that the ceiling on ML progress depends less on model size than on whether a task is learnable at all.
Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation
Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang · Feb 12, 2026 · Citations: 0

Expert Verification

Then, we propose the Generalized On-Policy Distillation (G-OPD) framework, which extends the standard OPD objective by introducing a flexible reference model and a reward scaling factor that controls the relative weight of the reward term…
Toward LLM-Supported Automated Assessment of Critical Thinking Subskills
Marisa C. Peczuh, Nischal Ashok Kumar, Ryan Baker, Blair Lehman, Danielle Eisenberg · Oct 14, 2025 · Citations: 0

Rubric Rating

As the world becomes increasingly saturated with AI-generated content, disinformation, and algorithmic persuasion, critical thinking - the capacity to evaluate evidence, detect unreliable claims, and exercise independent judgment - is…
MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation
Iman Ahmadi, Mehrshad Taji, Arad Mahdinezhad Kashani, AmirHossein Jadidi, Saina Kashani · Feb 18, 2026 · Citations: 0

Simulation Env Multi Agent

MALLVI presents a Multi Agent Large Language and Vision framework that enables closed-loop feedback driven robotic manipulation.
World Simulation with Video Foundation Models for Physical AI
NVIDIA, :, Arslan Ali, Junjie Bai, Maciej Bala · Oct 28, 2025 · Citations: 0

Simulation Env Long Horizon

These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems.
A Benchmark for Deep Information Synthesis
Debjit Paul, Daniel Murphy, Milan Gritta, Ronald Cardenas, Victor Prokhorov · Feb 24, 2026 · Citations: 0

Automatic Metrics Tool Use

To address this, we introduce DEEPSYNTH, a novel benchmark designed to evaluate agents on realistic, time-consuming problems that combine information gathering, synthesis, and structured reasoning to produce insights.
Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models
Maximilian Kreutner, Marlene Lutz, Markus Strohmaier · Jun 13, 2025 · Citations: 0

Automatic MetricsSimulation Env

We evaluate whether predictions are stable in response to counterfactual arguments, different persona prompts, and generation methods.
GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL
Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang · Feb 25, 2026 · Citations: 0

Automatic Metrics Long Horizon

Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks.
Inner Speech as Behavior Guides: Steerable Imitation of Diverse Behaviors for Human-AI coordination
Rakshit Trivedi, Kartik Sharma, David C Parkes · Feb 24, 2026 · Citations: 0

Demonstrations

Effective human-AI coordination requires artificial agents capable of exhibiting and responding to human-like behaviors while adapting to changing contexts.
Programming by Backprop: An Instruction is Worth 100 Examples When Finetuning LLMs
Jonathan Cook, Silvia Sapora, Arash Ahmadian, Akbir Khan, Tim Rocktaschel · Jun 23, 2025 · Citations: 0

Demonstrations

Though execution of instructions in training data remains less reliable than when instructions are given in-context, our results demonstrate that procedural knowledge can be noisily `programmed' into LLMs through PBB, with important…

Related Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote