0% found this document useful (0 votes)

13 views2 pages

Manual

Uploaded by

jignesh123dd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views2 pages

Manual

Uploaded by

jignesh123dd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 2

NAME

pdfsandwich - A generator for sandwich OCR pdfs from scanned pdf files
SYNOPSIS
pdfsandwich [options] inputfile.pdf
DESCRIPTION
pdfsandwich generates "sandwich" OCR pdf files, i.e. pdf files which contain only
images (no text) will be processed by optical character recognition (OCR) and the
text will be added to each page invisibly "behind" the images.
Note that pdfsandwich needs the following programs: unpaper, convert, gs,
hocr2pdf (for tesseract < 3.03), and tesseract.
As tesseract >= 3.03 can write pdf files, hocr2pdf is only needed for older
versions of tesseract.
Please visit http://www.tobias-elze.de/pdfsandwich.
OPTIONS
-convert -convert filename : name of convert binary (default: convert)
-coo -coo options : additional convert options; make sure to quote;
e.g. -coo "-normalize -black-threshold 75%"
call convert --help or man convert for all convert options
-debug keep all temporary files in /tmp (for debugging)
-enforcehocr2pdf use hocr2pdf even if tesseract >= 3.03
-first_page -first_page number : number of page to start OCR from (default:
1)
-gray use grayscale for images (default: black and white)
-grayfilter enable unpaper's gray filter; further options can be set by -
unpo
-gs -gs filename : name of gs binary (default: gs); optional, only
required for resizing
-hocr2pdf -hocr2pdf filename : name of hocr2pdf binary (default:
hocr2pdf);
ignored for tesseract >= 3.03 unless option -enforcehocr2pdf is set
-hoo -hoo options : additional hocr2pdf options; make sure to quote
-identify -identify filename : name of identify binary (default: identify)
-last_page -last_page number : number of page up to which to process OCR
(default: number of pages in inputfile)
-lang -lang language : language of the text; option to tesseract (default:
eng)
e.g: eng, deu, deu-frak, fra, rus, swe, spa, ita, ...
see option -list_langs;
Multiple languages may be specified, separated by plus characters.
-layout -layout { single | double | none } : layout of the scanned pages;
requires unpaper
single: one page per sheet
double: two pages per sheet
none: no auto-layout (default)
-list_langs list currently available languages and exit;
in case of custom binaries of tesseract, place this after the -
tesseract option
-maxpixels -maxpixels NUM : maximal number of pixels allowed for input file
if (resolution/72)^2 *width*height > maxpixels then scale page of
input file down
prior to OCR so that page size in pixels corresponds to maxpixels;
default: 17415167 (A3 @ 300 dpi)
-noimage do not place the image over the text (requires hocr2pdf; ignored
without -enforcehocr2pdf option)
-nopreproc do not preprocess with unpaper
-nthreads -nthreads number : number of parallel threads (default: guessed
number of CPUs; if guessing fails: 1)
-o -o filename : output file; default: inputfile_ocr.pdf (if extension is
different
from .pdf, original extension is kept)
-omp_thread_limit -omp_thread_limit number : number of threads tesseract may
use for each page (default: 1)
-pagesize -pagesize { original | NUMxNUM } : set page size of output pdf
(requires ghostscript)
original: same as input file (default)
NUMxNUM: width x height in pixel (e.g. for A4: -pagesize 595x842)
-pdfinfo -pdfinfo filename : name of pdfinfo binary (default: pdfinfo)
-pdfunite -pdfunite filename : name of pdfunite binary (default: pdfunite)
-resolution -resolution NUM : resolution (dpi) used for OCR (default: 300)
-rgb use RGB color space for images (default: black and white);
use with care: causes problems with some color spaces
-sloppy_text sloppily place text, group words, do not draw single glyphs;
ignored for tesseract >= 3.03 unless option -enforcehocr2pdf is set
-tesseract -tesseract filename : name of tesseract binary (default:
tesseract)
-tesso -tesso options : additional tesseract options; make sure to quote
-unpaper -unpaper filename : name of unpaper binary (default: unpaper)
-unpo -unpo options : additional unpaper options; make sure to quote
-quiet suppress output
-verbose produce more output
-version print version and quit
-help Display this list of options
--help Display this list of options

LANGUAGES
Via Tesseract, numerous language packagess available - follow this link
http://code.google.com/p/tesseract-ocr/downloads/list for a complete list. Here is
an incomplete selection of supported languages and their abbreviations:

ara (Arabic), aze (Azerbauijani), bul (Bulgarian), cat (Catalan), ces (Czech),
chi_sim (Simplified Chinese),
chi_tra (Traditional Chinese), chr (Cherokee), dan (Danish), dan-frak (Danish
(Fraktur)), deu (German), ell
(Greek), eng (English), enm (Old English), epo (Esperanto), est (Estonian), fin
(Finnish), fra (French), frm (Old
French), glg (Galician), heb (Hebrew), hin (Hindi), hrv (Croation), hun
(Hungarian), ind (Indonesian), ita
(Italian), jpn (Japanese), kor (Korean), lav (Latvian), lit (Lithuanian), nld
(Dutch), nor (Norwegian), pol
(Polish), por (Portuguese), ron (Romanian), rus (Russian), slk (Slovakian), slv
(Slovenian), sqi (Albanian), spa
(Spanish), srp (Serbian), swe (Swedish), tam (Tamil), tel (Telugu), tgl
(Tagalog), tha (Thai), tur (Turkish), ukr
(Ukrainian), vie (Vietnamese)

Multiple languages may be specified, separated by plus characters. Note that the
respective tesseract language package needs to be installed on your system to be
usable by pdfsandwich. Option -list_langs lists the languages which are available
on your system.
AVAILABILITY
Sources and packages as well as comprehensive help can be found at
http://www.tobias-elze.de/pdfsandwich.
AUTHOR
Tobias Elze <[email protected]>

RTTFA
No ratings yet
RTTFA
70 pages
Inset Training
No ratings yet
Inset Training
42 pages
Installing and Using Tesseract OCR PDF
100% (1)
Installing and Using Tesseract OCR PDF
5 pages
Critically Analysis The Marxism (Political Science-I)
No ratings yet
Critically Analysis The Marxism (Political Science-I)
13 pages
OpenCV OCR and Text Recognition With Tesseract - PyImageSearch
No ratings yet
OpenCV OCR and Text Recognition With Tesseract - PyImageSearch
65 pages
Ocr Nanonets Tesseract
No ratings yet
Ocr Nanonets Tesseract
39 pages
Saes N 120
100% (1)
Saes N 120
13 pages
Linux Command-Line Tips & Tricks
From Everand
Linux Command-Line Tips & Tricks
V. Subhash
No ratings yet
Options Pdf2cad v11
No ratings yet
Options Pdf2cad v11
5 pages
The Project Gutenberg RST Manual
From Everand
The Project Gutenberg RST Manual
Marcello Perathoner
No ratings yet
* calculus: 미적분학
No ratings yet
* calculus: 미적분학
8 pages
Cs302 Final Term Solved Papers Mega File
100% (1)
Cs302 Final Term Solved Papers Mega File
6 pages
A Level History Interpretations Coursework
100% (2)
A Level History Interpretations Coursework
7 pages
MVE 200/15E-30A0 (EE40020030A0JA0000) : 3 PH - 4 Poles - 1500 RPM - 220-240/380-415 V - 50 HZ
No ratings yet
MVE 200/15E-30A0 (EE40020030A0JA0000) : 3 PH - 4 Poles - 1500 RPM - 220-240/380-415 V - 50 HZ
1 page
Flight Performance and Planning (PPL)
No ratings yet
Flight Performance and Planning (PPL)
3 pages
Cpdfmanual
No ratings yet
Cpdfmanual
162 pages
CAPS Application Build
No ratings yet
CAPS Application Build
140 pages
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
From Everand
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
Joerg Christian Seubert
No ratings yet
Ocrmypdf Readthedocs Io en Stable
No ratings yet
Ocrmypdf Readthedocs Io en Stable
147 pages
Evaluation of Some Intrusion Detection and Vulnerability Assessment Tools
From Everand
Evaluation of Some Intrusion Detection and Vulnerability Assessment Tools
Dr. Hedaya Mahmood Alasooly
No ratings yet
Coherent Graphics LTD
No ratings yet
Coherent Graphics LTD
129 pages
Written Notes
No ratings yet
Written Notes
5 pages
Home UB-Mannheim-tesseract Wiki GitHub
No ratings yet
Home UB-Mannheim-tesseract Wiki GitHub
4 pages
Two - Component Two - Phase Flow Parameters For Low Circulation Rates
No ratings yet
Two - Component Two - Phase Flow Parameters For Low Circulation Rates
70 pages
Bosch Rexroth Gearbox Product Line
100% (1)
Bosch Rexroth Gearbox Product Line
10 pages
Sony Hcd-Eh45dab Ver.1.0
No ratings yet
Sony Hcd-Eh45dab Ver.1.0
38 pages
Reference No. Self-Assessment Guide Agricultural Crop Production NC Ii Perform Nursery Operations
No ratings yet
Reference No. Self-Assessment Guide Agricultural Crop Production NC Ii Perform Nursery Operations
4 pages
Instruction Manual and Spare Parts Catalogue: Bauer Kompressoren GMBH
No ratings yet
Instruction Manual and Spare Parts Catalogue: Bauer Kompressoren GMBH
44 pages
Tesseract 1
No ratings yet
Tesseract 1
35 pages
Fi Pdflatex mk4 - Bezdeklarace
No ratings yet
Fi Pdflatex mk4 - Bezdeklarace
41 pages
Terminal Saved Output
No ratings yet
Terminal Saved Output
22 pages
Why Is It Important To Learn English?
No ratings yet
Why Is It Important To Learn English?
35 pages
Tesseract Ocr
No ratings yet
Tesseract Ocr
3 pages
OCR With Tesseract, Amazon Textract, and Google Document AI: A Benchmarking Experiment
No ratings yet
OCR With Tesseract, Amazon Textract, and Google Document AI: A Benchmarking Experiment
22 pages
PRESENTATIONS
No ratings yet
PRESENTATIONS
21 pages
Tesseract 2
No ratings yet
Tesseract 2
15 pages
Manual de Medidores de GN B3 Roots
No ratings yet
Manual de Medidores de GN B3 Roots
32 pages
OCR Technical Documentation and Software Manual
No ratings yet
OCR Technical Documentation and Software Manual
14 pages
Extracting Text From Scanned PDF Using Pytesseract & Open CV
No ratings yet
Extracting Text From Scanned PDF Using Pytesseract & Open CV
9 pages
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
From Everand
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
An Experimental Performance Analysis On Robotics Process Automation (RPA) With Open Source OCR Engines: Microsoft Ocr and Google Tesseract OCR
No ratings yet
An Experimental Performance Analysis On Robotics Process Automation (RPA) With Open Source OCR Engines: Microsoft Ocr and Google Tesseract OCR
10 pages
أنظمة المعالجة والتطهير بالأشعة الفوق بنفسجية الصديقة للبيئة بدون مواد كيميائية
No ratings yet
أنظمة المعالجة والتطهير بالأشعة الفوق بنفسجية الصديقة للبيئة بدون مواد كيميائية
8 pages
Evaluation of Some Windows and Linux Intrusion Detection Tools
From Everand
Evaluation of Some Windows and Linux Intrusion Detection Tools
Dr. Hedaya Alasooly
No ratings yet
OCRHindi Using VietOCR and Tesseract PDF
No ratings yet
OCRHindi Using VietOCR and Tesseract PDF
7 pages
Installing and Using Tesseract 500 OCRFINAL
No ratings yet
Installing and Using Tesseract 500 OCRFINAL
4 pages
Module # 10C - Text Recognition With Tesseract OCR
No ratings yet
Module # 10C - Text Recognition With Tesseract OCR
8 pages
Tesseract OCR Engine: Svetlin Nakov and Veselin Kolev
No ratings yet
Tesseract OCR Engine: Svetlin Nakov and Veselin Kolev
19 pages
9589-First Manuscript-57755-2-10-20220620 - X
No ratings yet
9589-First Manuscript-57755-2-10-20220620 - X
12 pages
News
No ratings yet
News
8 pages
Emgucv - OCRForm - Cs at Master Emgucv - Emgucv GitHub
No ratings yet
Emgucv - OCRForm - Cs at Master Emgucv - Emgucv GitHub
8 pages
Tesseract
No ratings yet
Tesseract
6 pages
Build Your Own Optical Character Recognition (Ocr) System Using Google'S Tesseract and Opencv
No ratings yet
Build Your Own Optical Character Recognition (Ocr) System Using Google'S Tesseract and Opencv
10 pages
Setting Up A Simple OCR Server: by Real Python 37 Comments
No ratings yet
Setting Up A Simple OCR Server: by Real Python 37 Comments
8 pages
A New Drug-Shelf Arrangement For Reducing Medicati
No ratings yet
A New Drug-Shelf Arrangement For Reducing Medicati
9 pages
Ocr Intern Report PDF
No ratings yet
Ocr Intern Report PDF
14 pages
Tesseract
No ratings yet
Tesseract
6 pages
Resultant Forces
No ratings yet
Resultant Forces
7 pages
Jurnal Imunisasi
No ratings yet
Jurnal Imunisasi
10 pages
Thanks Google Hindi Ocr Guidelines
No ratings yet
Thanks Google Hindi Ocr Guidelines
15 pages
Optical Character Recognition by Open Source OCR Tool Tesseract A Case Study
No ratings yet
Optical Character Recognition by Open Source OCR Tool Tesseract A Case Study
7 pages
Exercises Gerund Infi Passive
No ratings yet
Exercises Gerund Infi Passive
5 pages
Madmaze - Pytesseract - A Python Wrapper For Google Tesseract
No ratings yet
Madmaze - Pytesseract - A Python Wrapper For Google Tesseract
5 pages
Source:: 0606/11/M/J/19 - Question No. 10
No ratings yet
Source:: 0606/11/M/J/19 - Question No. 10
4 pages
How To Auto Install All Kali Linux Tools Using Katoolin On DebianUbuntu
No ratings yet
How To Auto Install All Kali Linux Tools Using Katoolin On DebianUbuntu
4 pages
An Overview of The Tesseract OCR Engine: 2. Architecture
No ratings yet
An Overview of The Tesseract OCR Engine: 2. Architecture
6 pages
Index RST
No ratings yet
Index RST
3 pages
Wkhtmltopdf
No ratings yet
Wkhtmltopdf
7 pages
Package Tesseract': July 25, 2019
No ratings yet
Package Tesseract': July 25, 2019
5 pages
Bahasa Pemograman
No ratings yet
Bahasa Pemograman
5 pages
C Make Lists
No ratings yet
C Make Lists
6 pages
Iqjaqokskss
No ratings yet
Iqjaqokskss
3 pages
Galvocoat 16380
No ratings yet
Galvocoat 16380
2 pages
Requirements
No ratings yet
Requirements
1 page
Reaction Paper Template
No ratings yet
Reaction Paper Template
5 pages
6455SHHHCGD
No ratings yet
6455SHHHCGD
2 pages
Forensics - 23. Binwalk
No ratings yet
Forensics - 23. Binwalk
2 pages
CW 8200
No ratings yet
CW 8200
2 pages
Python Project
No ratings yet
Python Project
2 pages
Tesseract I CD Ar 2007
No ratings yet
Tesseract I CD Ar 2007
5 pages
Development of Lightning Detector System Using Multistation Method
No ratings yet
Development of Lightning Detector System Using Multistation Method
5 pages
Google Group Tesseract Ocr
No ratings yet
Google Group Tesseract Ocr
3 pages
Ocr PDF Ubuntu 10 04
No ratings yet
Ocr PDF Ubuntu 10 04
3 pages
Python Quebrar Captch Python Ocr
No ratings yet
Python Quebrar Captch Python Ocr
4 pages
Ahsbsdns
No ratings yet
Ahsbsdns
1 page
Graphing Checklist
No ratings yet
Graphing Checklist
1 page
How To
No ratings yet
How To
2 pages
Rubrics For Design Project 1 Report (Part 1) CPB 30703 Design Project 1
No ratings yet
Rubrics For Design Project 1 Report (Part 1) CPB 30703 Design Project 1
1 page
Python Tesseract
No ratings yet
Python Tesseract
2 pages
Code Snippets
No ratings yet
Code Snippets
2 pages
We Used Tesseract OCR For Train The Data and Recognize The Character From Digital Image Under The Apache 2
No ratings yet
We Used Tesseract OCR For Train The Data and Recognize The Character From Digital Image Under The Apache 2
1 page

Manual

Uploaded by

Manual

Uploaded by

NAME

You might also like