HFEPX Hub

Coding + Pairwise Preference (Last 90 Days)

Updated from current HFEPX corpus (Mar 1, 2026). 10 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Mar 1, 2026). 10 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequently cited benchmark: BrowseComp. Common metric signal: cost. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 11, 2026.

Papers: 10 Last published: Feb 11, 2026 Global RSS Tag RSS

CodingPairwise PreferenceLast 90d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (10) Replication-Ready Only (0)

High-Signal Coverage

100.0%

10 / 10 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

0 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
0 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Use this page for scouting only; collect additional papers before attempting replication-critical comparisons.

Why This Matters (Expanded)

Why This Matters For Eval Research

100% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 50% of papers in this hub.
BrowseComp is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Notes (Expanded)

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.
Stratify by benchmark (BrowseComp vs Charteditbench) before comparing methods.

Benchmark Interpretation

BrowseComp appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.
Charteditbench appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.

Metric Interpretation

cost is reported in 10% of hub papers (1/10); compare with a secondary metric before ranking methods.
helpfulness is reported in 10% of hub papers (1/10); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Strong: Papers with explicit human feedback

Coverage is strong (100% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (20% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (40% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (20% vs 35% target).
Strong: Papers with known annotation unit

Coverage is strong (40% vs 35% target).

Strengths

Strong human-feedback signal (100% of papers).
Agentic evaluation appears in 40% of papers.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (20% coverage).

Suggested Next Analyses

Stratify by benchmark (BrowseComp vs Charteditbench) before comparing methods.
Track metric sensitivity by reporting both cost and helpfulness.

Recommended Queries (Expanded)

Recommended Queries

Benchmark Slice: BrowseComp Metric Slice: cost Recent High-Signal Papers

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters
Feb 11, 2026 · Citations: 0 · Score: 8.0

HF: Pairwise Preference · Eval: Not reported · Benchmark: LiveCodeBench · Metric: Latency
ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models
Feb 17, 2026 · Citations: 0 · Score: 6.5

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Charteditbench · Metric: Not Reported
PrivAct: Internalizing Contextual Privacy Preservation via Multi-Agent Preference Training
Feb 14, 2026 · Citations: 0 · Score: 6.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Helpfulness
Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages
Feb 14, 2026 · Citations: 0 · Score: 6.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Toxicity
Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization
Jan 24, 2026 · Citations: 0 · Score: 5.5

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Task success
The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems
Feb 17, 2026 · Citations: 0 · Score: 4.5

HF: Pairwise Preference · Eval: Not reported · Benchmark: Not Reported · Metric: Not Reported

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters Feb 11, 2026	Yes Pairwise Preference	Not Reported	LiveCodeBench , BrowseComp	Latency , Cost	Not Reported
ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models Feb 17, 2026	Yes Pairwise Preference	Automatic Metrics	Charteditbench	Not Reported	Not Reported
PrivAct: Internalizing Contextual Privacy Preservation via Multi-Agent Preference Training Feb 14, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Helpfulness	Not Reported
Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages Feb 14, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Toxicity	Not Reported
Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization Jan 24, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Task success	Not Reported
The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems Feb 17, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported
Rethinking Metrics for Lexical Semantic Change Detection Feb 17, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Not Reported	Not Reported
Why Code, Why Now: Learnability, Computability, and the Real Limits of Machine Learning Feb 15, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported
gencat: Generative computerized adaptive testing Feb 23, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported
LogitsCoder: Towards Efficient Chain-of-Thought Path Search via Logits Preference Decoding for Code Generation Feb 15, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	Step 3.5 Flash: Open Frontier-Level Intelligence wi…	ChartEditBench: Evaluating Grounded Multi-Turn Char…	PrivAct: Internalizing Contextual Privacy Preservat…
Human Feedback	Pairwise Preference	Pairwise Preference	Pairwise Preference
Evaluation Modes	Not reported	Automatic Metrics	Automatic Metrics
Benchmarks	LiveCodeBench, BrowseComp	Charteditbench	Not reported
Metrics	Latency, Cost	Not reported	Helpfulness
Quality Controls	Not reported	Not reported	Not reported
Rater Population	Domain Experts	Unknown	Unknown
Annotation Unit	Unknown	Unknown	Unknown

Research Utility Snapshot

Human Feedback Mix

Pairwise Preference (10)

Evaluation Modes

Automatic Metrics (5)

Top Benchmarks

BrowseComp (1)
Charteditbench (1)
Imo Answerbench (1)
LiveCodeBench (1)

Top Metrics

Cost (1)
Helpfulness (1)
Latency (1)
Task success (1)

Rater Population Mix

Domain Experts (2)

Quality Controls

Coverage diagnostics (sample-based): human-feedback 100.0% · benchmarks 20.0% · metrics 40.0% · quality controls 0.0%.

Top Papers

Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters
Ailin Huang, Ang Li, Aobo Kong, Bin Wang, Binxing Jiao · Feb 11, 2026 · Citations: 0

Pairwise Preference Tool Use

We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency.
Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization
Jingyi Xu, Xingyu Ren, Zhoupeng Shou, Yumeng Zhang, Zhiqiang You · Jan 24, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Long Horizon

To address this, we propose Goal-Oriented Preference Optimization (GOPO), a hierarchical reinforcement learning framework that decouples strategy planning from response generation via an Expert Agent and a Customer Service Agent.
PrivAct: Internalizing Contextual Privacy Preservation via Multi-Agent Preference Training
Yuhan Cheng, Hancheng Ye, Hai Helen Li, Jingwei Sun, Yiran Chen · Feb 14, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Multi Agent

We propose PrivAct, a contextual privacy-aware multi-agent learning framework that internalizes contextual privacy preservation directly into models' generation behavior for privacy-compliant agentic actions.
The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems
Xiaoze Liu, Ruowang Zhang, Weichen Yu, Siheng Xiong, Liu He · Feb 17, 2026 · Citations: 0

Pairwise Preference Multi Agent

Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain shackled by the inefficiency of discrete text communication, which imposes significant runtime overhead and…
ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models
Manav Nitin Kapadnis, Lawanya Baghel, Atharva Naik, Carolyn Rosé · Feb 17, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

In practice, users iteratively refine visualizations through multi-turn interactions that require maintaining common ground, tracking prior edits, and adapting to evolving preferences.
Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages
Somnath Banerjee, Rima Hazra, Animesh Mukherjee · Feb 14, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Yet safety pipelines, benchmarks, and alignment still largely target English and a handful of high-resource languages, implicitly assuming safety and factuality ''transfer'' across languages.
Rethinking Metrics for Lexical Semantic Change Detection
Roksana Goworek, Haim Dubossarsky · Feb 17, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Lexical semantic change detection (LSCD) increasingly relies on contextualised language model embeddings, yet most approaches still quantify change using a small set of semantic change metrics, primarily Average Pairwise Distance (APD) and
Why Code, Why Now: Learnability, Computability, and the Real Limits of Machine Learning
Zhimin Zhao · Feb 15, 2026 · Citations: 0

Pairwise Preference

We propose a five-level hierarchy of learnability based on information structure and argue that the ceiling on ML progress depends less on model size than on whether a task is learnable at all.
gencat: Generative computerized adaptive testing
Wanyong Feng, Andrew Lan · Feb 23, 2026 · Citations: 0

Pairwise Preference

We train the model in a two-step process, first via Supervised Fine-Tuning and then via preference optimization for knowledge-response alignment.
LogitsCoder: Towards Efficient Chain-of-Thought Path Search via Logits Preference Decoding for Code Generation
Jizheng Chen, Weiming Zhang, Xinyi Dai, Weiwen Liu, Kounianhua Du · Feb 15, 2026 · Citations: 0

Pairwise Preference

LogitsCoder iteratively generates and refines reasoning steps by first steering token selection toward statistically preferred patterns via Logits Preference Decoding, then selecting and aggregating diverse reasoning paths using Logits Rank…

Related Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote