A robust, GPU-accelerated pipeline to convert PDF, Word, Excel, and PowerPoint documents into clean Markdown.
Features Surya-OCR and Marker for high-fidelity PDF extraction (tables, equations, layout) and MarkItDown for Office documents. Includes both a Modern GUI and a headless CLI for automation.
gui_converter.py: (Recommended) The modern graphical interface. Features dark mode, folder selection, and non-freezing background processing.cli_converter.py: The headless script. Best for servers or automated batch jobs.cleanup.py: A maintenance tool to wipe the output folder and remove stray temporary files.
- OS: Windows 10/11
- GPU: NVIDIA GPU (Required for reasonable performance).
- Manager: Micromamba.
Open Command Prompt (cmd) and run:
:: Create environment
micromamba create -n doc-convert python=3.10 -c conda-forge -y
:: Activate
micromamba activate doc-convertCritical: You must install the CUDA version of PyTorch to use your GPU.
pip install torch torchvision torchaudio --index-url [https://download.pytorch.org/whl/cu121](https://download.pytorch.org/whl/cu121)pip install customtkinter marker-pdf markitdown pymupdf pillowLaunch the graphical interface to select folders and watch the log in real-time.
python gui_converter.py- Click Select Input Folder (Put your PDFs/Docs here).
- Click Select Output Folder.
- Click START CONVERSION.
Run the headless script. Ensure you edit the INPUT_DIR and OUTPUT_DIR paths inside the script first.
python cli_converter.pyTo clean up the /out directory and remove any temporary chunks left behind after a crash:
python cleanup.py- "OSError: Page file too small": If you crash on huge files, ensure
MAX_PAGES_PER_CHUNKis set to 25 or lower in the script.- Stuck Logs: If the GUI logs look messy with progress bars, ensure you are using the latest version of
gui_converter.pywith the "Smart Filter" enabled.
- Stuck Logs: If the GUI logs look messy with progress bars, ensure you are using the latest version of