PIKA: Expert-Level Synthetic Datasets for Post-Training Alignment from Scratch
Shangjian Yin, Shining Liang, Wenbiao Ding, Yuli Qian, Zhouxing Shi, Hongzhi Li, Yutao Xie · Oct 8, 2025 · Citations: 0
How to use this paper page
Coverage: RecentUse this page to decide whether the paper is strong enough to influence an eval design. It summarizes the abstract plus available structured metadata. If the signal is thin, use it as background context and compare it against stronger hub pages before making protocol choices.
Best use
Secondary protocol comparison source
Metadata: RecentTrust level
Moderate
Signals: RecentWhat still needs checking
No explicit evaluation mode was extracted from available metadata.
Signal confidence: 0.55
Abstract
High-quality instruction data is critical for LLM alignment, yet existing open-source datasets often lack efficiency, requiring hundreds of thousands of examples to approach proprietary performance. In this work, we find that beyond the widely recognized importance of prompt-response quality, prompt difficulty itself plays a critical role in driving alignment gains. Motivated by this observation, we introduce PiKa, a data-efficient family of expert-level alignment datasets that concentrates supervision on high-difficulty instructions. The PiKa-SFT dataset contains only 30k examples, an order of magnitude fewer than state-of-the-art open datasets like Magpie-Pro. Despite its small size, fine-tuning Llama-3-8B-Base on PiKa-SFT even outperforms the official Llama-3-8B-Instruct model trained on over 10M proprietary examples on widely used benchmarks such as AlpacaEval 2.0 and Arena-Hard. We also validate the generalizability of PiKa across the Qwen2.5 series (0.5B-7B), consistently surpassing their official instruction-tuned counterparts. Additionally, we provide 30k high-quality preference optimization examples to further enhance alignment. Our results demonstrate that promising alignment is achievable with significantly reduced data, democratizing access for resource-constrained research. Our code and data will be available at https://github.com/SJY8460/PiKa.