Skip to content
implementation starting point
Benchmarks: thin evidence
Time to repro: a few hours
paddle

Results & Benchmarks

Freshness tier: cold
Direct + Inferred Evidence
Open-source Solution Precise Document Content Extraction
Pix2tex
BLEU
0.4080
Source: paper fulltext
Open-source Solution Precise Document Content Extraction
Texify
BLEU
0.5890
Source: paper fulltext
Open-source Solution Precise Document Content Extraction
Mathpix
BLEU
0.8067
Source: paper fulltext
Open-source Solution Precise Document Content Extraction
DocXchain
Academic Papers Val
52.8
Source: paper fulltext
Open-source Solution Precise Document Content Extraction
Surya
Academic Papers Val
24.2
Source: paper fulltext

Benchmark evidence drill-down

5 findings

Audit each benchmark finding before selecting an implementation path. Evidence refs map to the disclosure section below.

Task Dataset Metric Value Source Evidence refs
Open-source Solution Precise Document Content Extraction Pix2tex BLEU 0.4080 paper-derived No explicit refs
Open-source Solution Precise Document Content Extraction Texify BLEU 0.5890 paper-derived No explicit refs
Open-source Solution Precise Document Content Extraction Mathpix BLEU 0.8067 paper-derived No explicit refs
Open-source Solution Precise Document Content Extraction DocXchain Academic Papers Val 52.8 paper-derived No explicit refs
Open-source Solution Precise Document Content Extraction Surya Academic Papers Val 24.2 paper-derived No explicit refs

MinerU: An Open-Source Solution for Precise Document Content Extraction is the primary contribution described in this paper.

Use This Implementation Because…

Confidence: high

opendatalab/mineru is the strongest maintained implementation based on ranking signals. CI workflows are present. License is declared (NOASSERTION).

Open opendatalab/mineru

Reproduction Risks

  • No repository-level red flags were detected, but paper-specific preprocessing and hyperparameter details may still be under-specified.
Evidence disclosure

Evidence graph: 4 refs, 4 links.

Utility signals: depth 90/100, grounding 95/100, status high.

Implementation Comparison

Top 3 paths

Compare maintenance quality, reproducibility coverage, and evidence confidence before choosing a reproduction baseline.

opendatalab/mineru
best maintained
Maintenance: Active
Confidence: High
Reproducibility: Strong

Official implementation from Papers with Code · Repository link is mentioned in the paper metadata

Stars
67,882
Last push
Jun 17, 2026 (1d ago)
CIReleasesDependencies

Risk flags

  • No Docker setup
opendatalab/PDF-Extract-Kit
historical official
Maintenance: Stale
Confidence: High
Reproducibility: Moderate

Official implementation from Papers with Code · Matched via arXiv identifier search

Stars
9,729
Last push
Jan 3, 2025 (531d ago)
ReleasesDependencies

Risk flags

  • No push in 12+ months
  • No CI pipeline detected
  • No Docker setup
opendatalab/MinerU
alternative
Maintenance: Active
Confidence: Medium
Reproducibility: Strong

Matched via arXiv identifier search · Partial overlap with paper title keywords

Stars
67,882
Last push
Jun 17, 2026 (1d ago)
CIReleasesDependencies

Risk flags

  • No Docker setup

Best implementation now

opendatalab/mineru
Confidence: High
Reproducibility: Strong

Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows.

Stars: 67,882
Forks: 5,714
Last push: Jun 17, 2026
License: NOASSERTION
Official implementation from Papers with Code
Repository link is mentioned in the paper metadata
Partial overlap with paper title keywords
Community adoption signal (67882 stars)
License ✓
CI ✓
Deps ✓
Docker –
  • Selected opendatalab/mineru as the strongest maintained implementation for new work.
  • Includes CI workflow signals.
  • Includes dependency/environment manifest signals.
  • Repository activity is within the last 24 months.

Historical official implementation

Preserved for provenance. Not recommended as the default path for new builds.

opendatalab/PDF-Extract-Kit
Stars: 9,729
Last push: Jan 3, 2025

Reproduction readiness

Ready to Run
Time to first repro: hours
Last checked: Jun 17, 2026

Ready to reproduce

  • · Clone opendatalab/mineru and install dependencies from pyproject.toml.
  • · CI pipeline detected — automated tests are in place.
  • · Last updated 1 days ago.
Open opendatalab/mineru

Quick start

git clone https://github.com/opendatalab/mineru.git
pip install -e .

Additional implementations

Official

No additional official repositories detected.

Community

  • opendatalab/MinerU
    Confidence: Medium

    Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows.

    Stars: 67,882
    Last push: Jun 17, 2026
    License: NOASSERTION
  • A diffusion-based framework for document OCR that replaces autoregressive decoding with block-level parallel diffusion decoding.

    Stars: 602
    Last push: Apr 20, 2026
    License: MIT

These repositories had low-confidence matching signals and are hidden by default.

Hugging Face artifacts

No direct paper-linked artifacts were found. Showing strongest curated related artifacts for faster exploration.

Datasets

Spaces

No trustworthy demo spaces right now.

Search spaces on Hugging Face

Research context

Tasks

Open-source Solution Precise Document Content Extraction

Methods

None detected

Domains

None detected

Evaluation & Human Feedback Data

Open this paper in HFEPX to review benchmark signals, evaluation modes, and human-feedback protocol context.

Open in HFEPX

Explore Similar Papers

Jump to Paper2Code search queries derived from this paper's research context.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.