HFEPX Hub

Coding + Pairwise Preference (Last 30 Days)

Updated from current HFEPX corpus (Mar 8, 2026). 13 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Mar 8, 2026). 13 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Calibration. Frequently cited benchmark: BrowseComp. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 11, 2026.

Papers: 13 Last published: Feb 11, 2026 Global RSS Tag RSS

CodingPairwise PreferenceLast 30d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (13) Replication-Ready Only (0)

High-Signal Coverage

100.0%

13 / 13 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

0 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
1 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Use this page for scouting only; collect additional papers before attempting replication-critical comparisons.

Why This Matters For Eval Research

100% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 53.8% of papers in this hub.
BrowseComp is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

Most common quality-control signal is rater calibration (7.7% of papers).
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.
Stratify by benchmark (BrowseComp vs Charteditbench) before comparing methods.

Benchmark Interpretation

BrowseComp appears in 7.7% of hub papers (1/13); use this cohort for benchmark-matched comparisons.
Charteditbench appears in 7.7% of hub papers (1/13); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 15.4% of hub papers (2/13); compare with a secondary metric before ranking methods.
cost is reported in 7.7% of hub papers (1/13); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Strong: Papers with explicit human feedback

Coverage is strong (100% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (7.7% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (15.4% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (46.2% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (15.4% vs 35% target).
Strong: Papers with known annotation unit

Coverage is strong (38.5% vs 35% target).

Strengths

Strong human-feedback signal (100% of papers).

Known Gaps

Only 7.7% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (15.4% coverage).
Benchmark coverage is thin (15.4% of papers mention benchmarks/datasets).

Suggested Next Analyses

Stratify by benchmark (BrowseComp vs Charteditbench) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries (Expanded)

Recommended Queries

Benchmark Slice: BrowseComp Metric Slice: accuracy Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Para…

Highest protocol score with explicit human/eval signal plus LiveCodeBench.

Strongest benchmark reference

RewardUQ: A Unified Framework for Uncertainty-Aware Reward Models

Reported benchmark with accuracy gives a fast comparison anchor.

Strongest recent paper

ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multi…

Useful for current practice scanning; published Feb 17, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters
Feb 11, 2026 · Citations: 0 · Score: 8.0

HF: Pairwise Preference · Eval: Not reported · Benchmark: LiveCodeBench · Metric: Latency
RewardUQ: A Unified Framework for Uncertainty-Aware Reward Models
Feb 27, 2026 · Citations: 0 · Score: 7.5

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models
Feb 17, 2026 · Citations: 0 · Score: 6.5

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Charteditbench · Metric: Not Reported
PrivAct: Internalizing Contextual Privacy Preservation via Multi-Agent Preference Training
Feb 14, 2026 · Citations: 0 · Score: 6.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Helpfulness
PrivMedChat: End-to-End Differentially Private RLHF for Medical Dialogue Systems
Mar 3, 2026 · Citations: 0 · Score: 6.0

HF: Pairwise Preference, Expert Verification · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Rouge
Surgical Post-Training: Cutting Errors, Keeping Knowledge
Mar 2, 2026 · Citations: 0 · Score: 6.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters Feb 11, 2026	Yes Pairwise Preference	Not Reported	LiveCodeBench , BrowseComp	Latency , Cost	Not Reported
RewardUQ: A Unified Framework for Uncertainty-Aware Reward Models Feb 27, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Accuracy	Calibration
ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models Feb 17, 2026	Yes Pairwise Preference	Automatic Metrics	Charteditbench	Not Reported	Not Reported
PrivAct: Internalizing Contextual Privacy Preservation via Multi-Agent Preference Training Feb 14, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Helpfulness	Not Reported
PrivMedChat: End-to-End Differentially Private RLHF for Medical Dialogue Systems Mar 3, 2026	Yes Pairwise Preference , Expert Verification	Automatic Metrics	Not Reported	Rouge	Not Reported
Surgical Post-Training: Cutting Errors, Keeping Knowledge Mar 2, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Accuracy	Not Reported
Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages Feb 14, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Toxicity	Not Reported
The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems Feb 17, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported
Rethinking Metrics for Lexical Semantic Change Detection Feb 17, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Not Reported	Not Reported
Why Code, Why Now: Learnability, Computability, and the Real Limits of Machine Learning Feb 15, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported
EstLLM: Enhancing Estonian Capabilities in Multilingual LLMs via Continued Pretraining and Post-Training Mar 2, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported
gencat: Generative computerized adaptive testing Feb 23, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	Step 3.5 Flash: Open Frontier-Level Intelligence wi…	RewardUQ: A Unified Framework for Uncertainty-Aware…	ChartEditBench: Evaluating Grounded Multi-Turn Char…
Human Feedback	Pairwise Preference	Pairwise Preference	Pairwise Preference
Evaluation Modes	Not reported	Automatic Metrics	Automatic Metrics
Benchmarks	LiveCodeBench, BrowseComp	Not reported	Charteditbench
Metrics	Latency, Cost	Accuracy	Not reported
Quality Controls	Not reported	Calibration	Not reported
Rater Population	Domain Experts	Unknown	Unknown
Annotation Unit	Unknown	Ranking	Unknown

Research Utility Snapshot

Human Feedback Mix

Pairwise Preference (13)
Expert Verification (1)

Evaluation Modes

Automatic Metrics (7)

Top Benchmarks

BrowseComp (1)
Charteditbench (1)
Imo Answerbench (1)
LiveCodeBench (1)

Top Metrics

Accuracy (2)
Cost (1)
Helpfulness (1)
Latency (1)

Rater Population Mix

Domain Experts (1)
Mixed (1)

Quality Controls

Calibration (1)

Coverage diagnostics (sample-based): human-feedback 100.0% · benchmarks 15.4% · metrics 46.2% · quality controls 7.7%.

Top Papers

Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters
Ailin Huang, Ang Li, Aobo Kong, Bin Wang, Binxing Jiao · Feb 11, 2026 · Citations: 0

Pairwise Preference Tool Use

We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency.
RewardUQ: A Unified Framework for Uncertainty-Aware Reward Models
Daniel Yang, Samuel Stante, Florian Redhardt, Lena Libon, Parnian Kassraie · Feb 27, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Reward models are central to aligning large language models (LLMs) with human preferences.
PrivAct: Internalizing Contextual Privacy Preservation via Multi-Agent Preference Training
Yuhan Cheng, Hancheng Ye, Hai Helen Li, Jingwei Sun, Yiran Chen · Feb 14, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Multi Agent

We propose PrivAct, a contextual privacy-aware multi-agent learning framework that internalizes contextual privacy preservation directly into models' generation behavior for privacy-compliant agentic actions.
PrivMedChat: End-to-End Differentially Private RLHF for Medical Dialogue Systems
Sudip Bhujel · Mar 3, 2026 · Citations: 0

Pairwise PreferenceExpert Verification Automatic Metrics

Conventional supervised fine-tuning and reinforcement learning from human feedback (RLHF) can amplify memorization risks, enabling empirical membership inference and extraction of rare training-set content.
Surgical Post-Training: Cutting Errors, Keeping Knowledge
Wenye Lin, Kai Han · Mar 2, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

While prior research emphasizes the role of on-policy data in mitigating forgetting, we uncover--and validate both theoretically and empirically--an overlooked yet critical mechanism: the implicit regularization inherent in Direct…
The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems
Xiaoze Liu, Ruowang Zhang, Weichen Yu, Siheng Xiong, Liu He · Feb 17, 2026 · Citations: 0

Pairwise Preference Multi Agent

Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain shackled by the inefficiency of discrete text communication, which imposes significant runtime overhead and…
ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models
Manav Nitin Kapadnis, Lawanya Baghel, Atharva Naik, Carolyn Rosé · Feb 17, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

In practice, users iteratively refine visualizations through multi-turn interactions that require maintaining common ground, tracking prior edits, and adapting to evolving preferences.
Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages
Somnath Banerjee, Rima Hazra, Animesh Mukherjee · Feb 14, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Yet safety pipelines, benchmarks, and alignment still largely target English and a handful of high-resource languages, implicitly assuming safety and factuality ''transfer'' across languages.
Rethinking Metrics for Lexical Semantic Change Detection
Roksana Goworek, Haim Dubossarsky · Feb 17, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Lexical semantic change detection (LSCD) increasingly relies on contextualised language model embeddings, yet most approaches still quantify change using a small set of semantic change metrics, primarily Average Pairwise Distance (APD) and
Why Code, Why Now: Learnability, Computability, and the Real Limits of Machine Learning
Zhimin Zhao · Feb 15, 2026 · Citations: 0

Pairwise Preference

We propose a five-level hierarchy of learnability based on information structure and argue that the ceiling on ML progress depends less on model size than on whether a task is learnable at all.
EstLLM: Enhancing Estonian Capabilities in Multilingual LLMs via Continued Pretraining and Post-Training
Aleksei Dorkin, Taido Purason, Emil Kalbaliyev, Hele-Andra Kuulmets, Marii Ojastu · Mar 2, 2026 · Citations: 0

Pairwise Preference

We subsequently apply supervised fine-tuning, preference optimization, and chat vector merging to introduce robust instruction-following behavior.
gencat: Generative computerized adaptive testing
Wanyong Feng, Andrew Lan · Feb 23, 2026 · Citations: 0

Pairwise Preference

We train the model in a two-step process, first via Supervised Fine-Tuning and then via preference optimization for knowledge-response alignment.
LogitsCoder: Towards Efficient Chain-of-Thought Path Search via Logits Preference Decoding for Code Generation
Jizheng Chen, Weiming Zhang, Xinyi Dai, Weiwen Liu, Kounianhua Du · Feb 15, 2026 · Citations: 0

Pairwise Preference

LogitsCoder iteratively generates and refines reasoning steps by first steering token selection toward statistically preferred patterns via Logits Preference Decoding, then selecting and aggregating diverse reasoning paths using Logits Rank…

Related Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote