Skip to content

bacoco/DeepSynth

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ DeepSynth Multilingual Summarization Framework

Transform any document into actionable insights with state-of-the-art multilingual AI summarization

DeepSynth is powered by the open-source DeepSeek-OCR foundation model.

Repository note: the GitHub slug remains bacoco/deepseek-synthesia until the migration to the deepsynth organisation is complete.

Production Ready Python 3.9+ Multilingual License

Docker + Web Interface. Multiple datasets. Easy training.

docker compose -f deploy/docker-compose.gpu.yml up -d
open http://localhost:5001

Launch the container, access the web interface, configure your training, and start fine-tuning DeepSeek-OCR models with an intuitive GUI.

πŸ“š Documentation index

The complete documentation suite now lives under docs/. Start with the documentation index for curated links to architecture, delivery reports, deployment instructions, and UI guides.


πŸ’‘ Why DeepSynth Multilingual Summarization?

The Problem

  • Global information overload: Millions of documents in multiple languages to process
  • Language barriers: Traditional models work well only in English
  • Time-consuming manual summarization: Hours spent reading lengthy multilingual content
  • Traditional NLP limitations: Text-only models miss visual context and document structure

Our Solution

✨ Multilingual vision-powered summarization that understands documents like humans do:

  • 5+ languages supported: French, Spanish, German, English, and more
  • 20x compression: Condenses documents efficiently through visual encoding
  • Incremental processing: Resumable pipeline with automatic HuggingFace uploads
  • Production-ready: From multilingual datasets to deployed model in minutes, not weeks

🌍 Supported Languages & Datasets

Language Dataset Examples Status
πŸ‡«πŸ‡· French MLSUM French 392,902 βœ… Priority #1
πŸ‡ͺπŸ‡Έ Spanish MLSUM Spanish 266,367 βœ… Priority #2
πŸ‡©πŸ‡ͺ German MLSUM German 220,748 βœ… Priority #3
πŸ‡ΊπŸ‡Έ English News CNN/DailyMail 287,113 βœ… Priority #4
πŸ‡ΊπŸ‡Έ English BBC XSum Reduced ~50,000 βœ… Priority #5
πŸ“œ Legal English BillSum 22,218 βœ… Priority #6

Total: ~1.29M+ multilingual summarization examples

Note: MLSUM English and Chinese are not available in the original dataset. English coverage is provided through CNN/DailyMail and XSum alternatives.


🎯 What DeepSynth Does

DeepSynth provides two main workflows:

1. πŸ“Š Dataset Generation

  • Convert text documents to visual format (PNG images)
  • Process multilingual datasets (French, Spanish, German, English)
  • Upload prepared datasets to HuggingFace
  • Use case: Prepare training data for vision-language models

2. πŸš€ Model Training

  • Fine-tune DeepSeek-OCR on your datasets
  • Support for LoRA/QLoRA (memory-efficient training)
  • Web interface for easy configuration
  • Use case: Train custom summarization models

🧠 Architecture

  • Vision-Language Model: Based on DeepSeek-OCR
  • Text-to-Image: Converts documents to visual format
  • Fine-tuning Ready: LoRA/QLoRA support for efficient training
  • Web Interface: Easy-to-use training configuration

πŸ“Š Industry-Standard Benchmarks

Compare your model against the best:

Benchmark Description Typical ROUGE-1 Your Model
CNN/DailyMail News articles (287k) 44.16 (BART) 🎯 Test now
XSum Extreme summarization (204k) 47.21 (Pegasus) 🎯 Test now
arXiv Scientific papers 46.23 (Longformer) 🎯 Test now
PubMed Medical abstracts 45.97 🎯 Test now
SAMSum Dialogue (14.7k) 53.4 (BART) 🎯 Test now

Use the web interface to benchmark your trained models against standard datasets.

🎨 Production-Ready Deployment

  • REST API: Flask server with comprehensive endpoints
  • Batch processing: Handle thousands of documents
  • Model versioning: Track experiments and iterations
  • HuggingFace integration: Instant model sharing
  • Docker support: Containerized deployment

⚑ Quick Start

🎯 Docker Setup

Requirements:

  • Docker installed
  • GPU (recommended for training) or CPU (for dataset generation)
  • HuggingFace account (free)

πŸš€ Local Docker Setup

Quick Start - Launch Container

# Clone repository
git clone https://github.com/bacoco/DeepSynth.git
cd DeepSynth

# Setup environment
cp .env.example .env
# Edit .env and add your HF_TOKEN=hf_your_token_here

# Launch container in background
cd deploy
docker compose -f docker-compose.gpu.yml up -d

Access the Interface

Container Management

# Check container status
docker compose -f docker-compose.gpu.yml ps

# View logs
docker compose -f docker-compose.gpu.yml logs -f

# Stop container
docker compose -f docker-compose.gpu.yml down

# Restart container
docker compose -f docker-compose.gpu.yml restart

Training Workflow

  1. Open interface in browser (http://localhost:5001)
  2. Configure HuggingFace token in the top section
  3. Select datasets for training (refresh to load your datasets)
  4. Configure training parameters (batch size, epochs, etc.)
  5. Start training and monitor progress (uses GPU if available)
  6. Access trained models in ./trained_model/ directory

πŸ“š Use Cases

πŸ“° News Aggregation

Summarize hundreds of news articles daily:

from deepsynth.inference import DeepSynthSummarizer

summarizer = DeepSynthSummarizer("your-username/model")
summary = summarizer.summarize_text(long_article)

πŸ”¬ Research Assistant

Process academic papers through the web interface

πŸ’Ό Business Intelligence

Generate executive summaries from reports via the web UI

πŸ“ž Customer Support

Summarize conversation transcripts using trained models


πŸ† Performance Metrics

Standard Evaluation Metrics

ROUGE Scores (overlap-based):

  • ROUGE-1: Unigram overlap (typical: 40-47)
  • ROUGE-2: Bigram overlap (typical: 18-28)
  • ROUGE-L: Longest common subsequence (typical: 37-49)

BERTScore (semantic similarity):

  • Measures meaning, not just words
  • More robust to paraphrasing
  • Typical scores: 85-92

Compression Ratio:

  • How efficiently the model summarizes
  • Typical: 3-10x compression

Benchmark Your Model

Use the web interface to evaluate your trained models against standard benchmarks. The interface provides:

  • ROUGE Scores: Overlap-based metrics (ROUGE-1, ROUGE-2, ROUGE-L)
  • BERTScore: Semantic similarity evaluation
  • Comparison to SOTA: See how your model compares to state-of-the-art
  • Multiple Benchmarks: CNN/DailyMail, XSum, arXiv, PubMed, SAMSum

πŸ”§ Advanced Usage

Custom Dataset Training

from deepsynth.config import Config
from data.prepare_and_publish import DatasetPipeline

# Configure for your domain
config = Config.from_env()
pipeline = DatasetPipeline("your/dataset", subset=None)

# Prepare and upload
dataset_dict = pipeline.prepare_all_splits(
    output_dir=Path("./custom_data"),
    max_samples=10000
)
pipeline.push_to_hub(dataset_dict, "username/custom-dataset")

Hyperparameter Tuning

Edit .env for different configurations:

# For better quality (slower training)
BATCH_SIZE=4
NUM_EPOCHS=5
LEARNING_RATE=1e-5
GRADIENT_ACCUMULATION_STEPS=8

# For faster iteration (lower quality)
BATCH_SIZE=8
NUM_EPOCHS=1
LEARNING_RATE=3e-5
GRADIENT_ACCUMULATION_STEPS=2

Deployment Options

1. REST API Server

MODEL_PATH=./deepsynth-ocr-summarizer python -m deepsynth.inference.api_server

# Test endpoint
curl -X POST http://localhost:5000/summarize/text \
    -H "Content-Type: application/json" \
    -d '{"text": "Long document...", "max_length": 128}'

2. Batch Processing

python -m evaluation.generate \
    input_documents.jsonl \
    --model ./deepsynth-ocr-summarizer \
    --output summaries.jsonl

3. HuggingFace Inference

from transformers import pipeline

summarizer = pipeline("summarization", model="username/model")
summary = summarizer(long_text, max_length=130, min_length=30)

πŸ“Š Architecture Deep Dive

Visual-Language Pipeline

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Input Document (Text)                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Text-to-Image Converter                        β”‚
β”‚  β€’ Renders text as PNG (1600x2200px)                       β”‚
β”‚  β€’ Preserves layout and structure                          β”‚
β”‚  β€’ ~85 chars per line, 18pt font                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         DeepEncoder (Frozen - 380M params)                  β”‚
β”‚  β€’ Visual feature extraction (SAM + CLIP)                   β”‚
β”‚  β€’ 20x compression (1 visual token β‰ˆ 20 text tokens)       β”‚
β”‚  β€’ Output: Visual tokens [batch, seq, hidden]              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚      MoE Decoder (Fine-tuned - 570M active params)          β”‚
β”‚  β€’ Mixture of Experts architecture                          β”‚
β”‚  β€’ 3B total params, 570M active per token                   β”‚
β”‚  β€’ Autoregressive generation                                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  Generated Summary (Text)                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why This Architecture Works

  1. Visual Encoding Advantage

    • Captures document layout, not just text
    • Handles tables, formatting, structure
    • Natural compression through visual tokens
  2. Frozen Encoder Benefits

    • Faster training (only 570M params trainable)
    • Leverages pre-trained vision knowledge
    • Prevents catastrophic forgetting
  3. MoE Decoder Efficiency

    • 3B parameter capacity with 570M active
    • Sparse activation = fast inference
    • Specialized experts for different content types

πŸ“– Documentation

πŸ“ Complete documentation is now organized in the docs/ directory

Document Description
docs/README.md πŸ“š Complete documentation index
docs/QUICKSTART.md ⚑ 5-minute quick start guide
docs/PRODUCTION_GUIDE.md πŸš€ Production deployment guide
docs/IMAGE_PIPELINE.md πŸ–ΌοΈ Dataset preparation with images
docs/deepseek-ocr-resume-prd.md πŸ“‹ Product requirements document

πŸ—‚οΈ Repository Structure

DeepSynth/
β”œβ”€β”€ πŸ“„ README.md                 # This file - project overview
β”œβ”€β”€ βš™οΈ requirements.txt          # Python dependencies
β”œβ”€β”€ πŸ”§ .env.example              # Environment configuration template
β”œβ”€β”€
β”œβ”€β”€ πŸ“š docs/                     # Complete documentation
β”œβ”€β”€ 🎯 examples/                 # Example scripts and tutorials
β”œβ”€β”€ πŸ”§ tools/                    # Utility tools and scripts
β”œβ”€β”€ πŸ“œ scripts/                  # Shell scripts and automation
β”œβ”€β”€
β”œβ”€β”€ πŸ’» src/                      # Source code
β”œβ”€β”€ πŸ§ͺ tests/                    # Test suites
β”œβ”€β”€ 🐳 deploy/                   # Docker and deployment configs
β”œβ”€β”€ πŸ“Š benchmarks/               # Benchmark results
β”œβ”€β”€ πŸ“¦ datasets/                 # Local dataset cache
└── 🎯 trained_model/            # Model outputs

🀝 Contributing

We welcome contributions! Areas for improvement:

  • Additional benchmark datasets
  • More evaluation metrics (METEOR, BLEU)
  • Docker deployment examples
  • Multi-language support
  • Streaming inference
  • Model distillation

See the contribution guidelines for details.


πŸ“Š Benchmark Leaderboard

Compare your results with the community:

Model CNN/DM R-1 CNN/DM R-2 CNN/DM R-L XSum R-1 XSum R-2
BART-large 44.16 21.28 40.90 45.14 22.27
Pegasus 44.17 21.47 41.11 47.21 24.56
T5-large 42.50 20.68 39.75 43.52 21.55
Your Model ? ? ? ? ?

Run benchmarks and share your results!


πŸŽ“ Research & Citations

This implementation is based on:

@article{deepseek2024ocr,
  title={DeepSeek-OCR: Unified Document Understanding with Vision-Language Models},
  author={DeepSeek-AI},
  journal={arXiv preprint arXiv:2510.18234},
  year={2024}
}

Related Papers:


πŸ”’ Security & Privacy

  • βœ… No data leakage: All secrets in .env (gitignored)
  • βœ… HuggingFace authentication: Secure token-based access
  • βœ… Private models: Support for private HuggingFace repos
  • βœ… Local processing: Train and deploy without external APIs

πŸ’Ό Commercial Use

This project uses the DeepSeek-OCR model license. For commercial applications:

  1. Review DeepSeek-OCR license
  2. Ensure compliance with model terms
  3. Consider training custom models for proprietary data

🌟 Success Stories

"Reduced our document processing time from 2 hours to 10 minutes" β€” Enterprise Customer

"The visual encoding captures nuances that text-only models miss" β€” ML Research Team

"Production deployment was surprisingly smoothβ€”everything just worked" β€” Startup Founder


πŸ“ž Support


πŸš€ Get Started Now

# 1. Clone and setup
git clone https://github.com/bacoco/DeepSynth.git
cd DeepSynth && cp .env.example .env

# 2. Launch container
cd deploy && docker compose -f docker-compose.gpu.yml up -d

# 3. Access web interface
open http://localhost:5001

Your AI-powered summarization system is just minutes away. πŸŽ‰


Built with ❀️ using DeepSeek-OCR
Turn information overload into actionable insights

Production Guide β€’ Image Pipeline β€’ Technical Docs

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •