Manual
Manual
pdfsandwich - A generator for sandwich OCR pdfs from scanned pdf files
SYNOPSIS
pdfsandwich [options] inputfile.pdf
DESCRIPTION
pdfsandwich generates "sandwich" OCR pdf files, i.e. pdf files which contain only
images (no text) will be processed by optical character recognition (OCR) and the
text will be added to each page invisibly "behind" the images.
Note that pdfsandwich needs the following programs: unpaper, convert, gs,
hocr2pdf (for tesseract < 3.03), and tesseract.
As tesseract >= 3.03 can write pdf files, hocr2pdf is only needed for older
versions of tesseract.
Please visit http://www.tobias-elze.de/pdfsandwich.
OPTIONS
-convert -convert filename : name of convert binary (default: convert)
-coo -coo options : additional convert options; make sure to quote;
e.g. -coo "-normalize -black-threshold 75%"
call convert --help or man convert for all convert options
-debug keep all temporary files in /tmp (for debugging)
-enforcehocr2pdf use hocr2pdf even if tesseract >= 3.03
-first_page -first_page number : number of page to start OCR from (default:
1)
-gray use grayscale for images (default: black and white)
-grayfilter enable unpaper's gray filter; further options can be set by -
unpo
-gs -gs filename : name of gs binary (default: gs); optional, only
required for resizing
-hocr2pdf -hocr2pdf filename : name of hocr2pdf binary (default:
hocr2pdf);
ignored for tesseract >= 3.03 unless option -enforcehocr2pdf is set
-hoo -hoo options : additional hocr2pdf options; make sure to quote
-identify -identify filename : name of identify binary (default: identify)
-last_page -last_page number : number of page up to which to process OCR
(default: number of pages in inputfile)
-lang -lang language : language of the text; option to tesseract (default:
eng)
e.g: eng, deu, deu-frak, fra, rus, swe, spa, ita, ...
see option -list_langs;
Multiple languages may be specified, separated by plus characters.
-layout -layout { single | double | none } : layout of the scanned pages;
requires unpaper
single: one page per sheet
double: two pages per sheet
none: no auto-layout (default)
-list_langs list currently available languages and exit;
in case of custom binaries of tesseract, place this after the -
tesseract option
-maxpixels -maxpixels NUM : maximal number of pixels allowed for input file
if (resolution/72)^2 *width*height > maxpixels then scale page of
input file down
prior to OCR so that page size in pixels corresponds to maxpixels;
default: 17415167 (A3 @ 300 dpi)
-noimage do not place the image over the text (requires hocr2pdf; ignored
without -enforcehocr2pdf option)
-nopreproc do not preprocess with unpaper
-nthreads -nthreads number : number of parallel threads (default: guessed
number of CPUs; if guessing fails: 1)
-o -o filename : output file; default: inputfile_ocr.pdf (if extension is
different
from .pdf, original extension is kept)
-omp_thread_limit -omp_thread_limit number : number of threads tesseract may
use for each page (default: 1)
-pagesize -pagesize { original | NUMxNUM } : set page size of output pdf
(requires ghostscript)
original: same as input file (default)
NUMxNUM: width x height in pixel (e.g. for A4: -pagesize 595x842)
-pdfinfo -pdfinfo filename : name of pdfinfo binary (default: pdfinfo)
-pdfunite -pdfunite filename : name of pdfunite binary (default: pdfunite)
-resolution -resolution NUM : resolution (dpi) used for OCR (default: 300)
-rgb use RGB color space for images (default: black and white);
use with care: causes problems with some color spaces
-sloppy_text sloppily place text, group words, do not draw single glyphs;
ignored for tesseract >= 3.03 unless option -enforcehocr2pdf is set
-tesseract -tesseract filename : name of tesseract binary (default:
tesseract)
-tesso -tesso options : additional tesseract options; make sure to quote
-unpaper -unpaper filename : name of unpaper binary (default: unpaper)
-unpo -unpo options : additional unpaper options; make sure to quote
-quiet suppress output
-verbose produce more output
-version print version and quit
-help Display this list of options
--help Display this list of options
LANGUAGES
Via Tesseract, numerous language packagess available - follow this link
http://code.google.com/p/tesseract-ocr/downloads/list for a complete list. Here is
an incomplete selection of supported languages and their abbreviations:
ara (Arabic), aze (Azerbauijani), bul (Bulgarian), cat (Catalan), ces (Czech),
chi_sim (Simplified Chinese),
chi_tra (Traditional Chinese), chr (Cherokee), dan (Danish), dan-frak (Danish
(Fraktur)), deu (German), ell
(Greek), eng (English), enm (Old English), epo (Esperanto), est (Estonian), fin
(Finnish), fra (French), frm (Old
French), glg (Galician), heb (Hebrew), hin (Hindi), hrv (Croation), hun
(Hungarian), ind (Indonesian), ita
(Italian), jpn (Japanese), kor (Korean), lav (Latvian), lit (Lithuanian), nld
(Dutch), nor (Norwegian), pol
(Polish), por (Portuguese), ron (Romanian), rus (Russian), slk (Slovakian), slv
(Slovenian), sqi (Albanian), spa
(Spanish), srp (Serbian), swe (Swedish), tam (Tamil), tel (Telugu), tgl
(Tagalog), tha (Thai), tur (Turkish), ukr
(Ukrainian), vie (Vietnamese)
Multiple languages may be specified, separated by plus characters. Note that the
respective tesseract language package needs to be installed on your system to be
usable by pdfsandwich. Option -list_langs lists the languages which are available
on your system.
AVAILABILITY
Sources and packages as well as comprehensive help can be found at
http://www.tobias-elze.de/pdfsandwich.
AUTHOR
Tobias Elze <[email protected]>