Skip to content
← Back to explorer

Less is More: Improving LLM Alignment via Preference Data Selection

Xun Deng, Han Zhong, Rui Ai, Fuli Feng, Zheng Wang, Xiangnan He · Feb 20, 2025 · Citations: 0

Abstract

Direct Preference Optimization (DPO) has emerged as a promising approach for aligning large language models with human preferences. While prior work mainly extends DPO from the aspect of the objective function, we instead improve DPO from the largely overlooked but critical aspect of data selection. Specifically, we address the issue of parameter shrinkage caused by noisy data by proposing a novel margin-maximization principle for dataset curation in DPO training. To further mitigate the noise in different reward models, we propose a Bayesian Aggregation approach that unifies multiple margin sources (external and implicit) into a single preference probability. Extensive experiments in diverse settings demonstrate the consistently high data efficiency of our approach. Remarkably, by using just 10\% of the Ultrafeedback dataset, our approach achieves 3\% to 8\% improvements across various Llama, Mistral, and Qwen models on the AlpacaEval2 benchmark. Furthermore, our approach seamlessly extends to iterative DPO, yielding a roughly 3\% improvement with 25\% online data, revealing the high redundancy in this presumed high-quality data construction manner. These results highlight the potential of data selection strategies for advancing preference optimization.

HFEPX Relevance Assessment

This paper has direct human-feedback and/or evaluation protocol signal and is likely useful for eval pipeline design.

Eval-Fit Score

50/100 • Medium

Useful as a secondary reference; validate protocol details against neighboring papers.

Human Feedback Signal

Detected

Evaluation Signal

Weak / implicit signal

HFEPX Fit

High-confidence candidate

Human Data Lens

  • Uses human feedback: Yes
  • Feedback types: Pairwise Preference
  • Rater population: Unknown
  • Unit of annotation: Unknown
  • Expertise required: General
  • Extraction source: Persisted extraction

Evaluation Lens

  • Evaluation modes:
  • Agentic eval: None
  • Quality controls: Not reported
  • Confidence: 0.55
  • Flags: ambiguous

Protocol And Measurement Signals

Benchmarks / Datasets

AlpacaEval 2.0

Reported Metrics

No metric terms were extracted from the available abstract.

Research Brief

Deterministic synthesis

Direct Preference Optimization (DPO) has emerged as a promising approach for aligning large language models with human preferences. HFEPX signals include Pairwise Preference with confidence 0.55. Updated from current HFEPX corpus.

Generated Mar 4, 2026, 7:23 AM · Grounded in abstract + metadata only

Key Takeaways

  • Direct Preference Optimization (DPO) has emerged as a promising approach for aligning large language models with human preferences.
  • To further mitigate the noise in different reward models, we propose a Bayesian Aggregation approach that unifies multiple margin sources (external and implicit) into a single…

Researcher Actions

  • Compare its human-feedback setup against pairwise and rubric hubs.
  • Cross-check benchmark overlap: AlpacaEval 2.0.
  • Verify metric definitions before comparing against your eval pipeline.

Caveats

  • Generated from title, abstract, and extracted metadata only; full-paper implementation details are not parsed.
  • Extraction confidence is probabilistic and should be validated for critical decisions.

Research Summary

Contribution Summary

  • Direct Preference Optimization (DPO) has emerged as a promising approach for aligning large language models with human preferences.
  • To further mitigate the noise in different reward models, we propose a Bayesian Aggregation approach that unifies multiple margin sources (external and implicit) into a single preference probability.
  • Remarkably, by using just 10\% of the Ultrafeedback dataset, our approach achieves 3\% to 8\% improvements across various Llama, Mistral, and Qwen models on the AlpacaEval2 benchmark.

Why It Matters For Eval

  • Direct Preference Optimization (DPO) has emerged as a promising approach for aligning large language models with human preferences.
  • To further mitigate the noise in different reward models, we propose a Bayesian Aggregation approach that unifies multiple margin sources (external and implicit) into a single preference probability.

Researcher Checklist

  • Pass: Human feedback protocol is explicit

    Detected: Pairwise Preference

  • Gap: Evaluation mode is explicit

    No clear evaluation mode extracted.

  • Gap: Quality control reporting appears

    No calibration/adjudication/IAA control explicitly detected.

  • Pass: Benchmark or dataset anchors are present

    Detected: AlpacaEval 2.0

  • Gap: Metric reporting is present

    No metric terms extracted.

Related Papers

Papers are ranked by protocol overlap, extraction signal alignment, and semantic proximity.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.