Matched via arXiv identifier search · Partial overlap with paper title keywords
- Stars
- 1,044
- Last push
- Jun 17, 2026 (3d ago)
Risk flags
- No CI pipeline detected
- No tagged releases
- No Docker setup
Sen Xu, Shixi Liu, Wei Wang, Jixin Min, Yingwei Dai, Zhibin Yin, Yirong Chen, Xin Zhou, Junlin Zhang
Core AI workload signals detected from paper context and implementation/artifact evidence.
This technical report introduces VibeThinker-3B, a compact dense model with 3B parameters developed to investigate how far verifiable reasoning can be pushed within a strictly small-model regime. Building upon the Spectrum-to-Signal post-training paradigm, we systematically enhance the model through an optimized pipeline that includes curriculum-based supervised fine-tuning, multi-domain reinforcement learning, and o ...
ffline self-distillation. Experimental evaluations demonstrate that VibeThinker-3B achieves frontier-level performance on highly demanding verifiable tasks. Specifically, it attains a score of 94.3 on AIME26 (improving to 97.1 with claim-level test-time scaling), an 80.2 Pass@1 on LiveCodeBench v6, and exhibits strong out-of-distribution generalization with a 96.1\% acceptance rate on recent unseen LeetCode contests. This effectively places it in the performance band of first-tier reasoning systems, matching or exceeding flagship models that are orders of magnitude larger, such as DeepSeek V3.2, GLM-5, and Gemini 3 Pro. Furthermore, a score of 93.4 on IFEval confirms that this extreme reasoning enhancement does not compromise strict instruction controllability. Extending our previous 1.5B work, these findings motivate the Parametric Compression-Coverage Hypothesis, which views verifiable reasoning as compressible into compact reasoning cores, while open-domain knowledge and general-purpose competence require broad parameter coverage over facts, concepts, and long-tail scenarios. This perspective suggests that compact models are not merely deployment-efficient substitutes, but a complementary path toward frontier-level performance in parameter-dense capability regimes.
Some benchmark signal exists in the extracted evidence, but it is not structured strongly enough yet for a confident benchmark decision.
This technical report introduces VibeThinker-3B, a compact dense model with 3B parameters developed to investigate how far verifiable reasoning can be pushed within a strictly small-model regime.
WeiboAI/VibeThinker is the best available implementation candidate based on ranking signals, but recommendation confidence is not yet high. License is declared (MIT).
Open WeiboAI/VibeThinkerHardware Notes
Expect multi-day setup/compute for meaningful reproduction based on current guidance.
Evidence graph: 4 refs, 4 links.
Utility signals: depth 100/100, grounding 95/100, status high.
Compare maintenance quality, reproducibility coverage, and evidence confidence before choosing a reproduction baseline.
Matched via arXiv identifier search · Partial overlap with paper title keywords
Risk flags
Matched via arXiv identifier search · Strong overlap with paper title keywords
Risk flags
Matched via arXiv identifier search
Risk flags
Tiny Model, Big Logic: Diversity-Driven Optimization Elicits Large-Model Reasoning Ability in VibeThinker-1.5B
Hardware requirements
No dependency manifest — manual reconstruction required
No additional verified repositories beyond the primary recommendation.
These repositories had low-confidence matching signals and are hidden by default.
No direct paper-linked artifacts were found. Showing strongest curated related artifacts for faster exploration.
No trustworthy dataset matches right now.
Search datasets on Hugging FaceNo trustworthy demo spaces right now.
Search spaces on Hugging FaceTasks
Instruction tuning
Methods
Reinforcement learning
Domains
Natural Language Processing, Large Language Models
Evaluation & Human Feedback Data
Open this paper in HFEPX to review benchmark signals, evaluation modes, and human-feedback protocol context.
Open in HFEPXExplore Similar Papers
Jump to Paper2Code search queries derived from this paper's research context.
Need human evaluators for your AI research? Scale annotation with expert AI Trainers.