OpenTrain AI
Maintained implementation availablepytorchPretrained Models Available

MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation

December 1, 2023arXiv: 2312.17080
3 repos52 stars~a few days to reproduce
arXiv PDF

Abstract

Results & Benchmarks

Benchmark data is not yet available for this paper.

Hardware Requirements

  • Expect multi-day setup/compute for meaningful reproduction based on current guidance.

Best Implementation

Challenge LLMs to Reason About Reasoning: A Benchmark to Unveil Cognitive Depth in LLMs

52 2 Jul 2024 MIT
License
CI
Deps
Docker
  • Selected dvlab-research/diaggsm8k as the strongest maintained implementation for new work.
  • Repository activity is within the last 24 months.
  • Official repository is preserved separately as historical context.

Reproduction Path

  1. 1

    Start with dvlab-research/diaggsm8k and validate setup instructions in README.

  2. 2

    Reproduce the baseline result with the provided defaults before modifying hyperparameters.

  3. 3

    Log exact dependency versions and runtime environment for reproducibility.

Time to first repro: a few daysNo CI workflows detectedDependency manifest is missing

Additional Implementations

Official

No additional official repositories detected.

Community

  • Challenge LLMs to Reason About Reasoning: A Benchmark to Unveil Cognitive Depth in LLMs

    Stars: 52Forks: 2Last push: Jul 2024License: MIT

Hugging Face Artifacts

No direct paper-linked artifacts were found. Showing strongest curated related artifacts.