Announcement

Free to view yesterday and today
Customer Service: cat_manager

MinerU: High-Quality PDF to Markdown & JSON Converter

An open-source, high-quality tool to extract data from PDFs, converting them into structured Markdown and JSON formats. Streamline your document processing workflows.

Python
Added on 2025年6月19日
View on GitHub
MinerU: High-Quality PDF to Markdown & JSON Converter preview
35,415
Stars
2,892
Forks
Python
Language

Project Introduction

Summary

MinerU is an advanced open-source project designed for robust and high-fidelity data extraction and document conversion from PDF files. It provides powerful capabilities to transform PDF content into flexible Markdown and structured JSON formats.

Problem Solved

Manually extracting data or converting complex documents from PDF is time-consuming, error-prone, and inefficient. MinerU automates this process, providing reliable structured outputs.

Core Features

High-Quality PDF to Markdown

Precisely convert complex PDF layouts into readable Markdown, preserving structure and formatting.

Structured Data Extraction to JSON

Extract structured data embedded within PDFs and output it as clean, parseable JSON.

All-in-One Data Extraction Solution

A comprehensive tool handling both document structure conversion and data extraction in one place.

Tech Stack

Python
PDF Parsing Libraries (e.g., pdfminer.six, PyMuPDF)
Markdown Generation
JSON Serialization

使用场景

MinerU can be applied in various scenarios requiring automated extraction and conversion of content from PDF documents.

Financial Document Processing

Details

Automate the extraction of financial data, tables, and text from quarterly reports or invoices into a structured JSON format for database import or analysis.

User Value

Saves significant manual data entry time and improves data accuracy from financial reports.

Academic & Content Conversion

Details

Convert academic papers, e-books, or articles from PDF into Markdown format for easier reading, annotation, or publication on blogs and websites.

User Value

Facilitates the repurposing and sharing of information from academic sources or e-books.

Legal Document Data Extraction

Details

Extract specific clauses, names, dates, or figures from legal documents or contracts into JSON for searchable databases or compliance checks.

User Value

Enables efficient searching and analysis across large volumes of legal texts.

Recommended Projects

You might be interested in these projects

dunst-projectdunst

Dunst是一个轻量级、高度可配置的通知守护进程,旨在替代传统的通知系统,为用户提供更灵活和非侵入性的桌面通知体验。

C
5004361
View Details

coleam00ottomator-agents

Explore and utilize a collection of open source AI Agents designed for the oTTomator Live Agent Studio platform, enabling advanced automation and intelligent workflows.

Python
31661167
View Details

wyeeeeehajimi

An open-source API proxy built with FastAPI for Google's Gemini API, offering enhanced control and flexibility for developers.

Python
11163550
View Details