Results & Benchmarks
| Task | Dataset | Metric | Value |
|---|---|---|---|
| Reinforcement learning | LLaMA-3-8B-it | GSM-8K | 79.6 |
| Reinforcement learning | Ours (SFT baseline) | GSM-8K | 74.2 |
Hardware Requirements
- Expect multi-day setup/compute for meaningful reproduction based on current guidance.
Best Implementation
Recipes to train reward model for RLHF.
1.5k 108 Apr 2025 Apache-2.0
License ✓
CI –
Deps –
Docker –
- Selected RLHFlow/RLHF-Reward-Modeling as the strongest maintained implementation for new work.
- Repository activity is within the last 24 months.
- Official repository is preserved separately as historical context.
Reproduction Path
- 1
Start with RLHFlow/RLHF-Reward-Modeling and validate setup instructions in README.
- 2
Reproduce the baseline result with the provided defaults before modifying hyperparameters.
- 3
Log exact dependency versions and runtime environment for reproducibility.
Time to first repro: a few daysNo CI workflows detectedDependency manifest is missing
Additional Implementations
Official
No additional official repositories detected.
Community
- RLHFlow/Online-RLHFConfidence: low
A recipe for online RLHF and online iterative DPO.
Stars: 543Forks: 48Last push: Dec 2024
Hugging Face Artifacts
No direct paper-linked artifacts were found. Showing strongest curated related artifacts.
Curated Related