Skip to content

troyscott/markforge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Document Converter (GPU-Accelerated)

A robust, GPU-accelerated pipeline to convert PDF, Word, Excel, and PowerPoint documents into clean Markdown.

Features Surya-OCR and Marker for high-fidelity PDF extraction (tables, equations, layout) and MarkItDown for Office documents. Includes both a Modern GUI and a headless CLI for automation.

📂 Project Structure

  • gui_converter.py: (Recommended) The modern graphical interface. Features dark mode, folder selection, and non-freezing background processing.
  • cli_converter.py: The headless script. Best for servers or automated batch jobs.
  • cleanup.py: A maintenance tool to wipe the output folder and remove stray temporary files.

📋 Prerequisites

  • OS: Windows 10/11
  • GPU: NVIDIA GPU (Required for reasonable performance).
  • Manager: Micromamba.

🛠️ Installation

1. Create Environment

Open Command Prompt (cmd) and run:

:: Create environment
micromamba create -n doc-convert python=3.10 -c conda-forge -y

:: Activate
micromamba activate doc-convert

2. Install PyTorch (CUDA)

Critical: You must install the CUDA version of PyTorch to use your GPU.

pip install torch torchvision torchaudio --index-url [https://download.pytorch.org/whl/cu121](https://download.pytorch.org/whl/cu121)

3. Install Dependencies

pip install customtkinter marker-pdf markitdown pymupdf pillow

🚀 How to Run

Option A: The GUI (Easiest)

Launch the graphical interface to select folders and watch the log in real-time.

python gui_converter.py
  1. Click Select Input Folder (Put your PDFs/Docs here).
  2. Click Select Output Folder.
  3. Click START CONVERSION.

Option B: The CLI (Automated)

Run the headless script. Ensure you edit the INPUT_DIR and OUTPUT_DIR paths inside the script first.

python cli_converter.py

🧹 Maintenance

To clean up the /out directory and remove any temporary chunks left behind after a crash:

python cleanup.py

⚠️ Troubleshooting

  • "OSError: Page file too small": If you crash on huge files, ensure MAX_PAGES_PER_CHUNK is set to 25 or lower in the script.
    • Stuck Logs: If the GUI logs look messy with progress bars, ensure you are using the latest version of gui_converter.py with the "Smart Filter" enabled.

About

AI document converter to markdown

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages