Matched via arXiv identifier search
- Stars
- 0
- Last push
- May 5, 2026 (44d ago)
Risk flags
- No CI pipeline detected
- No tagged releases
- No Docker setup
Shiping Gao, Hongzhan Chen, Xiaojun Quan, Qifan Wang, Lifu Huang
No strong AI-core implementation/artifact signals were detected from current providers.
Process reward models (PRMs) provide fine-grained supervision for reasoning, but reliable PRMs often require step annotations or heavy verification pipelines, making them costly to scale and refresh during online RL. Implicit PRMs reduce this cost by training log-likelihood-ratio rewards from trajectory-level outcome labels. However, the log-ratio is constrained only as a sequence-level aggregate during training, whi ...
le inference decomposes it into token- or step-level scores for partial prefixes. This train-inference mismatch leaves local credits weakly identified, so distribution-wide scoring can amplify misleading advantages. We propose Implicit Prefix-Value Reward Model (IPVRM), which directly learns the probability of eventual correctness for each prefix from outcome labels. Step signals are then obtained as temporal-difference (TD) differences between consecutive prefix values, aligning the training target with inference-time use. IPVRM markedly improves step-verification F1 on ProcessBench. To exploit these prefix values during policy optimization, we further introduce Distribution-Level RL (DistRL), which applies TD advantages to both sampled tokens and high-probability candidate tokens, providing dense counterfactual updates without additional rollouts. Experiments show that DistRL brings limited gains with unreliable implicit rewards, but consistently improves downstream reasoning when paired with IPVRM. The implementation of our method is available at https://github.com/gaoshiping/IPVRM .
Audit each benchmark finding before selecting an implementation path. Evidence refs map to the disclosure section below.
| Task | Dataset | Metric | Value | Source | Evidence refs |
|---|---|---|---|---|---|
| Prefix-value Learning Distribution-level Optimization | MATH | MATH-500 | 46.0 | paper-derived | No explicit refs |
| Prefix-value Learning Distribution-level Optimization | Qwen3-0.6B base | GSM8K | 50.8 | paper-derived | No explicit refs |
| Prefix-value Learning Distribution-level Optimization | DPO-RM | GSM8K | 32.7 | paper-derived | No explicit refs |
| Prefix-value Learning Distribution-level Optimization | IPVRM | GSM8K | 41.9 | paper-derived | No explicit refs |
Process reward models (PRMs) provide fine-grained supervision for reasoning, but reliable PRMs often require step annotations or heavy verification pipelines, making them costly to scale and refresh during online RL.
This is primarily a method paper. Reproduce it within a maintained framework baseline instead of chasing paper-specific repos.
Evidence graph: 2 refs, 1 links.
Utility signals: depth 90/100, grounding 68/100, status medium.
Compare maintenance quality, reproducibility coverage, and evidence confidence before choosing a reproduction baseline.
Matched via arXiv identifier search
Risk flags
There is no verified maintained implementation yet. Use this baseline plan to decide whether to prototype now or defer.
No verified implementation available
No additional verified repositories beyond the primary recommendation.
These repositories had low-confidence matching signals and are hidden by default.
No trustworthy direct or curated related Hugging Face artifacts were found yet.
Continue with targeted Hugging Face searches derived from the paper title and method context:
Datasets
Tip: start with models, then check datasets/spaces if you need evaluation data or demos.
Direct artifact matches are currently sparse. Use targeted Hugging Face searches to quickly locate candidate models, datasets, and demos.
Tasks
Prefix-value Learning Distribution-level Optimization
Methods
None detected
Domains
None detected
Evaluation & Human Feedback Data
Open this paper in HFEPX to review benchmark signals, evaluation modes, and human-feedback protocol context.
Open in HFEPXExplore Similar Papers
Jump to Paper2Code search queries derived from this paper's research context.
Need human evaluators for your AI research? Scale annotation with expert AI Trainers.