Scan and Share - Tutorial On Making Ebooks
Scan and Share - Tutorial On Making Ebooks
07-st
Tutorial on making e-books
written by V. and A.
2010
Contents
1 Introduction
1.1 In brief . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Scanning a book
13
19
. . . . . . . . . . . . . . . . . 25
33
36
38
38
40
1 Introduction
This is a mini-tutorial about scanning books and making high-quality files
out of them. This tutorial is intended for people who would like to make goodquality electronic books but do not know where to start. There are many
ways to get good results by scanning; this text shows you some reasonably
easy ways. The tutorial has step-by-step screenshots and assumes some familiarity with Windows. You may need to download and install a few programs
(see Appendix A).
1.1 In brief
For the impatient reader: The process consists roughly of the following stages:
1. Scan every page in 300dpi greyscale, save to TIF. Save a backup of your
scans!
2. Import images into ScanKromsator or ScanTailor, process images. Save
a backup of the processed images at this stage!
3. Create a DJVU file out of processed images.
4. Add OCR and/or bookmarks to the DJVU file.
(It is most important to master the stages 1 and 2, since the processed images
after stage 2 are much smaller than the initial scans, and you can send them
to somebody else if you have trouble with stages 3 and 4.)
If you dont know what 600dpi means: its called the resolution of the image and means
the number of image points (pixels) per inch (dpi=dots per inch).
black/white.2 If the book has a few pages with color illustrations, you will
need to scan them separately in 300dpi 24-bit color mode. The same applies
to colorful book covers that you also may want to scan.
Please note:
Never scan at 300dpi black/white! The quality of the results is never as
good as what you can get by scanning in 300dpi greyscale and following
this tutorial or equivalent methods.
Scanning in 300dpi greyscale is on most scanners exactly as quick as
scanning in 300dpi black/white or in any lower resolution! You will
not save time if you scan in 300dpi black/white or in 200dpi instead of
300dpi greyscale, but you do lose a lot of quality.
Scanning in 300dpi greyscale produces large intermediate scanned files,
which will be processed into very small DJVU files. Scanning in 600dpi
black/white produces smaller intermediate scanned files, but the process of scanning at 600dpi is much slower for most scanners. Also, its
easier to process 300dpi greyscale scans because they have less "digital
dirt" than 600dpi black/white scans.
It is nearly impossible to improve the quality of a poorly scanned and/or
incorrectly processed image of a book. For example, some e-books are
made by inexperienced people in 150dpi, or in color instead of black/white,
or the resolution was decreased after scanning in an attempt to reduce
the file size. These e-book files are huge in size. The visual and print
quality of such e-books is bad and cannot be improved! It is important
(and not difficult) to make the scanned image correctly and ensure great
quality of the resulting e-books. Read on!
A high-quality scanned e-book is small in size, has great visual appearance
on the screen and also when printed, and has searchable text. There are
many ways to achieve high quality of scanned e-books; all methods involve the
resolution of 600dpi. (Higher resolution almost never brings a significantly
better quality.) Output files are in the DJVU3 format and take typically about
5KB/page to 10KB/page. If your file is significantly larger, while the book
contains only black/white text and is printed reasonably clearly, something
was done incorrectly when producing the file.
You may of course experiment on your own with other programs. For example, some people use Photoshop with special plugins, Book Restorer, Corel
PhotoPaint, RasterID, even Matlab and IDL for image processing. This tutorial
presents a particular method that practically guarantees good results. If you
are a beginner, please make a few books by closely following the instructions
2
This kind of processing when the resolution of an image is increased is called upsampling.
If you dont know what DJVU is, please use Google or Wikipedia to read about it. The
DJVU format was specially developed for high-compression storage of scanned images. Most
e-books today are in the PDF format, but the PDF format was intended for documents created
in a word processor, i.e. for vector documents rather than scanned documents. Scanned ebooks in PDF format occupy more space and/or display slower than in the DJVU format.
3
in this tutorial. You will then see that you can achieve quite a high a level of
quality without excessive effort and without learning too many technicalities.
If you develop your own methods, for example by using different options or
different programs, you will be able to decide which method is best because
you can then compare the quality of the results with the reference quality
obtained by the methods in this tutorial.
2 Scanning a book
You pick up a thick volume. Maybe you think that only a maniac could scan
it, page after page. Yes, you are right! But you can become that kind of
maniac and scan books of any size without much discomfort if you organize
your work well.
For the impatient reader:
use any flatbed scanner, even a cheap one, and a program such as
IrfanView to control scanning
do not use a digital camera for scanning books!
do not use FineReader for scanning books!
Why not use FineReader for scanning? The FineReader is a good program
for making OCR but is not optimal for scanning and for processing the scans
with the goal of making a scanned e-book. FineReader attempts to give you
a kind of all-in-one solution for scanning and processing e-books; please resist the temptation to use just one program for everything. You will not get
good results with FineReader; in any case, nowhere as good as when you
follow this tutorial. FineReader has the following technical drawbacks: 1)
It sometimes uses JPEG for image compression. This is not appropriate for
black/white text! 2) It stores images internally as black/white 300dpi TIFFs
and auto-rotates them. Black/white 300dpi is adequate for OCR but not optimal for digital scanned e-books. The auto-rotate algorithm is faulty and
produces defects in the image (broken lines). The auto-rotation is hardcoded into FineReader 7.x, 8.x and cannot be disabled.4 3) If you scan in
300dpi greyscale, which is the procedure recommended here, FineReader will
perform all operations at 300dpi, rather than resample to 600dpi. ScanKromsator and ScanTailor will first resample to 600dpi and then perform processing. The results of FineReader processing are always going to be inferior for
these reasons.
Why not use a digital camera for scanning books? You will never get good
results even with expensive 10 Megapixel or whatever cameras. Never even
closely as good as with a flatbed scanner, even a cheap one. Look at figure 1
below and guess which of the two images of the same page is made by a digital
camera.
4
Only in FineReader version 9 there was added an option to disable this auto-rotation.
However, other features of FineReader remain. Also not ethat FineReader version 9 cannot
be used to produce OCR layer in DJVU files. I recommend using FineReader version 8.
Figure 1: Two images of the same page, one made by a digital camera, another by a cheap flatbed scanner. The image made by a flatbed scanner was
scanned at 300dpi greyscale and upsampled to 600dpi black/white. You can
guess which image was made by the digital camera! (Yes, the crappy one.)
We recommend that you always use a flatbed scanner and scan at 300dpi
greyscale or higher resolution.
For scanning, you need basically any program that can work with your scanner. Under Windows, the TWAIN scanner driver is popular.5 Under Linux,
many scanners are supported by the VueScan program, but you can use any
other program as long as your scanner is supported.
You can scan using any program like IrfanView, XnView, ACDSee, PhotoShop.
(Note that IrfanView is small an free.) It is important that your scanning
program does not try to do anything with the images; in particular, no deskew,
no optimizing, no resizing, nothing at all. You should be able to tell the
program just to save the scans for each page to the hard disk in the TIF
format.
It is convenient if your scanning program can save scanned images for every
page one after another, numbering the files like p0001.tif, p0002.tif, etc. For
example, VueScan and IrfanView can do this.
Most scanners are supported by TWAIN drivers; for other scanners you may need special
drivers.
Here you can choose how to number the scanned files, where to store them,
and in which format to save them. As shown, the files will be named page0001.tif,
page0002.tif, etc. You should select TIFF as the image format. (Do not use
JPEG as the output format!)
Click on Options to the right of Save as field. This will set the options for the
TIFF format.
You should select LZW compression; this will cut the TIFF file size in two,
compared with no compression (None).6 If you later find that you have compatibility problems with these TIFF files (i.e. you later use a program that
cannot open them) then you need to change the compression method.
Important: Do not use the JPEG compression method for black/white text!
JPEG compression introduces digital artifacts, that is funny-looking shades
around each letter (see figure 2). It is pointless to use JPEG for black/white
images.7
Now press OK and go to the TWAIN driver window for your scanner.
In the TWAIN window (or other configuration window if you are not using
TWAIN drivers), set the resolution to 300dpi and the color mode to greyscale.
(In some programs, this is called 8 bit greyscale.) These are the most important settings. Some scanning programs do not allow you to set explicitly the
resolution or the color mode; instead they say something like Black/white
photo or web-optimized quality. Avoid these programs, instead use some
program that allows you to set specifically 300dpi and 8-bit greyscale. If you
are not sure that your settings are right, you should try scanning one page,
save the file to disk as TIF, and check the properties of the file in a graphics
editor, to make sure that you actually got 300dpi and 8-bit greyscale.
6
Note that a typical page scanned in greyscale will occupy between 2 and 4 megabytes on
the hard disk with LZW compression.
7
The JPEG format actually cannot handle black/white images; when one converts
black/white images to JPEG, the software must convert those images into greyscale images.
The JPEG compression then introduces a certain quality loss, as shown in the figure. The
quality loss in JPEG compression is acceptable for photographs but may degrade black/white
text quite significantly, unless a high quality JPEG mode is selected. (The quality of JPEG
compression is usually selectable as a number from 1% to 100%. No visible artifacts would
appear at 90% quality or higher. But some programs, especially for making PDF files or for
optimizing images, may not allow you to set the JPEG quality manually.)
10
Do a preview scan. Then you can see what has been scanned in the
preview window. If needed, you can turn the page 90 degrees so that
the text is straight up. You can also adjust contrast, brightness, gamma
correction if necessary. Your goal is that the text must be clearly visible,
not too dark, and not too light.
Select the scanning region by using the mouse. You should select the
scanning region such that some white space is left around the text, but
no book crease or off-page regions are scanned. Your purpose is that the
scanning rectangle should fit around the text with some margin, so that
you will not lose any text even if you put the book a little askew on the
glass. And yet you do not want to scan any useless regions outside of
the page.
Press the Scan button with the mouse and wait until the scanner finishes scanning the page. This will get the scan of one page (or two pages
at once, if you can fit the book onto the scanner). The scanned file will
be saved to the disk.
Now that the scanning program is set up, you can scan all the pages
with the same settings. While the scanner lamp is moving back, turn
the next page and put the book back to the same place on the scanner.
Then press the mouse button to scan again. (The mouse can be left
pointing at the Scan button, so you dont need to look. Alternatively,
some scanners have buttons on them that make the next scan.)
This technique allows you to scan the entire book, one page after another,
without looking at the computer screen or at the keyboard. You can watch TV
or whatever while you are scanning. Depending on the scanner speed, you
can get between 100 and 200 scanned pages per hour. Some scanners are
particularly fast (e.g. Plustek OpticBook).
It is not necessary to set the book onto the scaner absolutely straight (edge
of the book parallel to the edge of the scanner). You should try to put it
reasonably straight, but it is unavoidable that pages will not all be scanned
completely straight; many pages will be slightly skewed. This small skew is
okay and will be corrected later (after scanning) by software. Correcting this
skew is called deskewing. Deskewing is very fast and efficient.
What you want to avoid when scanning:
Avoid very large skew angles, i.e. do not place a book at a large angle on
the glass. This kind of scan can still be deskewed, but the shapes will
probably not be as smooth as otherwise.
Avoid incomplete page scans, i.e. when some of the text is outside of the
scanning region. This means that some text will be lost (not scanned
at all). If you discover such a page, scan that page again with a correct
scanning region. In a science book, no part of the text is unimportant!
11
However, avoid scanning the library stamps or other marks on the pages. If
your book has stamps or other markings on some pages, just cover them with
a piece of paper while scanning, or remove them with digital image editor after
scanning. Nobody wants to see some ugly stamps or marks in the e-book!
Avoid scanning any off-page regions (this will be when your scanning
rectangle is way too large). This will produce a black shadow which,
in many cases, you will have to remove by hand while processing your
scans! (This is so because computers are not very good at guessing what
is a part of the book and what is dirt on a scan.)
Also, avoid producing a fuzzy image because some place on the page was
not close to the scanner glass.
The region of the text around the book crease is often difficult to scan. You can
try scanning one page at a time (rather than two pages) or pressing slightly
harder onto the book binding. It is important that the text is directly next
to the scanner glass. Even 1 mm distance between the glass and the paper
will make a very fuzzy scanned image in almost all scanners! Fuzzy scanned
images are not acceptable. It is very difficult to prepare a good quality final
e-book from fuzzy scans.
Should you scan one page at a time, or two pages at a time? It is faster to
scan a book two pages per scan rather than one page at a time. Doublepage scans can be cut quite efficiently and automatically (if they are scanned
cleanly) by software. But not all books can be scanned that way; many books
are too large (you wont fit two pages onto the glass unless you have an A3
scanner, which is usually expensive). Many books dont open sufficiently to
be scanned two pages per scan with good quality (some text near the crease is
lost or becomes too fuzzy, which is not acceptable!). You need to try two pages
at a time, try one page at a time, and then decide how to proceed. Regardless
of how you scan, the processing software will be able to prepare an e-book
with single page images, as long as everything is scanned correctly and the
images are clear.
The result, after scanning the entire book, is a directory full of TIFF files.
These files are the raw material that you will start processing after you finish scanning. Note that you need sufficient disk space to store all those
scans (at least 4MB per scanned image!). After you finish scanning, use a
slideshow mode of some picture viewer to quickly preview the scanned images
to make sure that you didnt miss any pages and that every page is adequately
scanned. It will be too late when you discover, at the final processing stage,
that some pages are only half-scanned or missing, especially when the book
has already left your hands!
Note: When you scan the book, scan all pages; please do not omit any pages,
including title pages, front matter, including any information about the publisher, the table of contents, the index, the bibliography, empty pages in the
middle of the book, page numbers, errata sheets, or anything else!!! You will
12
not save much time if you decide to skip 20 pages or so while scanning. However, a science book is almost unusable without bibliography and index and
without exact information about its publication. 8
In the example shown, a book was scanned with two pages per scan, and
apparently there was some skewing. Our task now is to split, to deskew, and
8
Also, do not think that you will make your life easier from the legal point of view if you
dont scan the publication information!
9
Please do not write email to Bolega asking for help, for documentation, for source code
of ScanKromsator, or for adding extra features! Instead, just learn to use it and make some
good quality e-books!
10
We will talk only about the bare minimum of ScanKromsator functions here. Unfortunately the ScanKromsator program does not yet have a comprehensive users manual describing all the functions.
13
to cut the page images so that every page has the same size and margins. If
your scan is single-page, you will not need to split, but you will still need to
deskew and cut. This operation is called kromsating in the program.11
14
Note that there are now green tick marks in the page list (top left column),
meaning that these pages have been draft kromsated successfully. For each
page you will see the blue lines across the page. These lines are the cutters that determine how the page image will be cut and split. Note that the
program attempts to determine automatically where to cut the margins and
where to split a two-page image into single pages. In some cases the program
may make a mistake and cut too much or too little; in that case you will later
be able to adjust the position of the cutters by hand.
15
First click the Page tab. Here you can set processing options
for cutting the pages. The option Split means to split the
two-page image into single pages. Deskew will deskew each
single page image separately. Despeckle removes small dots.
Sometimes Deskew makes pages significantly skewed; this
is usually due to some complicated illustrations. In that case,
check Art for these pages. You can set Ortho if the page
needs to be rotated by 90 degrees. You can set these options
separately for left and right (L and R) pages.
Now click on the Book tab. Here you set options related to
the size and layout of the pages in the final book. H.Gap is
the size of the margins. The value of 200 is good for 600dpi
(meaning 1/3 inch). Page width and height can be set to Auto.
You can also center the pages differently (align to center/align
to top/align to bottom).
We already visited the Files tab at the draft stage. It is very important to
have 600dpi as the output resolution in the Files tab!
Now click on the Options tab. Set Deskew method =
Auto (shear), Resample filter = Lanczos3. The setting Despeckle=Fine+Normal or Safe switches on an intelligent despeckle method that avoids removing the dots over i or j,
for example. Text sensitivity controls the logic of the autocutting. Low sensitivity might cut off the page numbers if they
are too far away from the text. You may need to adjust the
sensitivity settings a little bit; but in most cases they do not
need to be adjusted.
You can skip the Options 2 tab for now. Click on the Convert tab. Here you set the threshold for converting greyscale
images to black/white. Do not forget to hold the Ctrl key (to
set this for all pages) as you select Threshold=MiddleDark.
Experiment with other settings if you dont like the results.
Click the Quality tab; there you can further control the conversion to black/white. This is a very important function! Set
Enhance image, Blur=1, and Sharpen=1. What is important
is that the image will become smoother with this setting. The
values of Blur and Sharpen could be 2 instead of 1, although
the value 1 is usually good. A larger value will make the letters more black. You may need to experiment depending on
the quality of printing in a particular book.
Another important option is Gray enhance. Click on it since
you have greyscale scans (which is what you should have!).
16
Skip several tabs and click the Denoise tab. Set the parameters as
shown at right. These parameters
clean up the image. This is the last
set of options that we are going to
bother with right now.
You can use the FileOptions... menu to write the options to a file. This will
save you all this work for the next time.
The last step before the main processing is a visual checking of the position
of the cutters. You need to go through every page and check that the cutters
are correctly positioned. Yes, this is a bit boring... but you can make it quick.
Put two fingers of the left hand onto the keys q and w; pressing these keys
will go to the previous/next page. With the right hand, you hold the mouse
and adjust the position of the cutters wherever needed. Sometimes there is a
skewed shadow, or it is necessary for some reason to set the cutter line at an
angle rather than vertically or horizontally. Hold the Shift key and drag the
cutter by its end to achieve this.
You can copy the cutter position from
one page to another. Right-click on the
cutter, and you will see the menu as
shown. For instance, if the current
cutter position needs to be applied to
all subsequent pages, click Copy current position toall down.
17
The program will ask you to confirm that you really are sure you want to
change the resolution of the images. Confirm! The process will then start.
Now you need to wait a while. The upsampling operation can be quite slow;
in recent versions of ScanKromsator (5.8 and up) this operation was made
faster. You may expect to process 5 pages per minute or so. When everything
is finished, you should view the output files in the output folder. You should
check that all pages are cut and deskewed correctly. If some pages are not
processed correctly, you can repeat processing of just those pages with some
other options.
The main processing run may take some hours on a slow computer. It is not
necessary to process the entire book in one run. One can process only some
portion of the pages; then one needs to set BookPage widthFixed to the
size determined in the previous portion of the pages (so that all pages have
equal size at the end of processing). It is usually sufficient to take 10 to 15
pages for determining page size.
If you like, you can use the powerful cleaning features of ScanKromsator to
remove the digital dirt from some pages. Typically, the digital dirt is any
extraneous spots on the paper, pencil or pen marks, and library stamps. Of
course, you can also use any graphics editor to clean the images by hand.
Hopefully, there will not be many pages to clean.
You need to set the color of the illustration. For example, if the page contains
a greyscale photograph (rather than a color photograph or color diagram), set
Color=Gray.
We cannot discuss other zone options here; as you see, there are many options
intended for advanced users. But note that after kromsating the picture
zones will be saved to separate files. So after the main processing run you
will have to merge them with the page files. This is done by using the menu
command: ZonesPicture zoneMerge zones. The resulting page files will be
TIFF files in which the text is black/white but the picture zones have color.
ScanTailor has online documentation at its website; you can read about many
features of ScanTailor there. Therefore, here I will only show how to do the
most common processing steps.
20
Figure 6: ScanTailors main window with some scans loaded into a project
Unnamed.
This will start the automatic (batch) processing of steps 1-5 for all pages with
the default options. This process will take maybe 20 minutes or so (maybe
about 5 seconds per page), but at least you dont have to do anything while
the program is working. This is your draft run. While it is running, let me
try to explain what is actually happening now.
double-page scans that are already correctly oriented. Most likely, ScanTailor
will automatically and correctly split them into single-page scans. In some
rare cases the splitting is done incorrectly (e.g. too much text is cut off). In
this case you can go back to the split pages step and correct this by hand. If
your scans never need to be split, you can disable splitting (set it to manual
for all pages).
The third step is deskew, that is, a small rotation of each page to make the
orientation completely upright. Note that deskewing is applied separately to
every page, also to every split page. In most cases ScanTailor will correctly
make the orientation of the text as horizontal as possible. In very rare cases
you will have to adjust the deskewing by hand.
The fourth step is select content. It selects the rectangle that seems to contain all the text on the page. In quite a few pages this rectangle will be too
small or too big! (This is because it is difficult for the computer to understand
automatically what the actual text is and what is some artifact of scanning,
like a dark shadow at the edge of the page.) So it is at this step that you certainly will have to look at every page and check that the rectangle is selected
correctly. More about this below.
The fifth step is page layout. This step is fully manually controlled by the
user; each pages content rectangle is aligned (if desired) with the content
rectangles of all other pages, margins are added, and the resulting rectangle
is prepared.
Since it is only at step 4 that problems are quite likely to appear while step
5 is completely manual, I propose to run all the steps 1-5 automatically as
the draft run. After the draft run, you will have to return to step 3 and flip
manually through all pages to check that all is well. If needed, you will be
able to return to any previous step for every page where that step produced
an incorrect result. As experience shows, a non-negligible amount of work is
needed only for step 4 at this point.
The last step is output. At this step, which is usually quite slow but does
not require any attention from you, ScanTailor will produce the resulting TIFF
files in the output directory. After this step, you should flip through the
final page images again, and check that everything is okay (especially if there
were any color illustrations, see below). If there are no color illustrations, the
output is usually fine without any further manual work.
It is important to understand that your original scanned TIFFs will never be
changed; ScanTailor will only produce some new TIFFs in a different directory,
and this will be done only at the last step (the output). These TIFFs will be
the result of the ScanTailor processing.
file, also at any time without stopping the automatic run. This will save the
information gathered up to that point. (What if the power is cut to your computer? Then you will be able to continue right from the point where you last
saved the project file.)
When the draft run is completed, ScanTailor will stop and return to the first
page (figure 8).
Now you need to click on step 4 select content.
You will see an image of the first page with a rectangle around the text; this
is the rectangle that ScanTailor automatically selected according to its algorithms (figure 9). You will be able to see right away whether ScanTailor was
correct. Maybe on some pages text will be visibly cut off, or not included in
the rectangle. In order to correct all this, you will now flip through all the
pages in your project and correct all such possible errors. You will also be
able to immediately see and correct problems created at any previous steps
(1-4), such as incorrect splitting of double pages.
In the page shown in figure 9, everything is okay, so you go to the next page.
To flip to the next page, press PageDown or W on the keyboard. To go to
the previous page, press PageUp or Q on the keyboard. (Or you can use
the mouse wheel in the right column with thumbnails and then click on the
thumbnails.)
Note the long horizontal button over the thumbnails; this is the scroll lock
button. If this button is pressed, the thumbnail column will always show the
page you are currently working on. Otherwise you can scroll away from your
currently active page, to look at some other thumbnails.
As you go through the pages or switch between different steps, you may have
to wait a little bit as the display updates. Eventually, as you go through all
the pages, you will probably find a page where there is some problem after
the draft run. There are five main types of problems to be corrected; most
frequently:
1. the content rectangle needs adjusting (some text is outside, or the rectangle is too big and includes some noise)
2. the page alignment needs adjusting (usually at the beginning or at the
end of a chapter, when most of the text is at the top of page or at the
bottom of page)
3. incorrect splitting (this may happen when the page contains complicated
tables and so was split when it shouldnt have been)
4. incorrect deskewing (usually this happens when the page contains no
text but only some large shapeless illustration).
5. the scan was done incorrectly (e.g. the page was not completely scanned)
Let us see how these problems can be corrected.
24
25
Figure 13: The left part of the page is missing and cannot be included in the
content rectangle at all.
26
The problem is that some part of the text was cut away at the splitting step!
Click on split pages and you will see something like figure 14.
Figure 14: The splitting step shows the line of splitting. It was obviously
incorrect.
Clearly, you need to drag the line of splitting to the left. After dragging that
line, click again on select content. Now you will see a better content rectangle; still it needs to be adjusted a little, until you see something like figure 15.
27
and add the new TIFF file to the project. Right-click on the thumbnail of some
page; you will see a menu Insert before, Insert after, Remove. This allows
you to remove incorrect scans and insert new, corrected scanned pages into
the project (although this is done one page by one page, so if you want to add
a lot of pages, it is better to start a new project).
Notes about removing or adding scans:
When you remove pages from the project, the scans are not actually
removed from the disk. Also, you can remove only one page from a split
double-page scan, if necessary.
It is advisable not to remove any empty pages in the middle of the book,
because removing these pages will break the numeration of the pages.
Empty pages will take practically no space in the final file. However, it is
better to remove empty pages at the very beginning and at the very end
of the book.
When you add pages to the project, the new pages will have no processing
steps already run on them, while other pages might be already partially
processed. So, for instance, the new pages will appear to have the default
page layout settings, and you will have to run all the steps on them,
including select content and page layout.
When a page has been removed from project, and this page was part of
a two-page sheet, the settings for the other half of the sheet will be lost!
You will have to click on that page and run again the content selection
and page layout steps.
29
Note that you can apply the brightness settings to all pages at once, or only
to selected pages (select them with the mouse in the thumbnail column at
right), or only to pages after the current one. It is important to remember that
the settings you click on the output window (or anywhere else in ScanTailor)
are only for the current page unless you press Apply and select Apply to all
pages or something else.
You are basically almost done at this point! Click again on the first page
thumbnail, so that you see the first page, and then click on the play button
to the right of output. This will start the automatic processing of all pages.
This operation is the final run, which may take an hour or more (maybe
about 15 seconds per page).
After this operation is done, you can do a final check-up of the pages. If the
images for some pages are somehow still not correct, you can go back to any
step and re-do it.
If your pages are all black/white, the only possible problems are these:
Final image is too thin/too thick on some pages where the brightness
was for some reason different from that of all other pages.
Despeckling has removed some dots that are actually part of the text.
You can flip through the pages while viewing the despeckling results: click
on the despeckling tab in the output window. The red dots will show where
ScanTailor removed dots from the image. If you see that ScanTailor removed
dots that are not dirt but actually are points in the text, such as . . .
somewhere, you should use a different despeckling broom or disable despeckling altogether (or make the image thicker). Usually, ScanTailor will be
careful with despeckling, but there are some cases when despeckling needs
to be disabled for some or all pages.
Note: it is advisable to save your project often while you are working on it.
ScanTailor is a stable program, but Windows is not, so if your computer
crashes for any reason, you will be able to continue right where you last
saved.
When you are done, the final images are in the output directory as a bunch of
TIFF files. These files will be in 600dpi and black/white, so they will be much
smaller than your original greyscale scans. This concludes the processing of
scans; the next step would be converting these scans to DJVU, see section 5.
not particularly useful; making them all black/white will not significantly decrease the usefulness of the book, but it will significantly decrease the amount
of work you will have to expend on the file, and the final file size will be maybe
half the size.
If there are some pages with important illustrations, you need to navigate
to these pages and click on output. You will get to these pages if you flip
through all the final images after the output run. Do not wait until the final
image is produced and immediately click on Mixed in the Mode box.
In the mixed mode, ScanTailor will try to detect automatically where the
greyscale or color illustration is located on the page. As an example, see
figure 17.
Figure 17: In the mixed mode, the illustration below is automatically detected as the picture zone and is shown to you in changing color when you
click on the Picture Zones tab. Note that the upper illustration is purely
black/white and was not selected as a picture zone.
You can also adjust the brightness of the final image in the mixed mode.
Sometimes ScanTailor guesses the picture zones somewhat incorrectly. Then
you can draw your own picture zones with the mouse.
A few words about editing the picture zones. You can add new picture zones
with boundaries made of straight lines. You cannot delete the automatically
32
found picture zone. But you can substract a picture zone from the zones
already present. To do that, right-click on some point inside the picture zone
and select properties. Then you can select subtract from all layers or
subtract from the auto-layer.
If the automatically selected picture zone is very irregularly shaped, and if
this is not right, perhaps the easiest thing to do is to draw a big picture zone
around the automatically selected zone and select subtract from auto-layer,
so that the automatic picture zone is effectively removed, and then to draw
your own picture zones and select add to auto-layer. What you add to autolayer will take precedence over what you subtract from auto-layer. If you
click subtract from all layers, this is the highest layer and will subtract
also from your added layers.
The other possibility is not to tinker with picture zones but encode everything
as color. (The color mode.) In that mode, it is advisable to check the boxes
white margins and adjust luminosity. If you use this mode, the entire
image will be saved as a picture zone. This will result in larger files, but
is entirely acceptable and perhaps necessary if you have very complicated
graphics that are not greyscale. Experiment and see what works best for your
scans.
In any case, you can immediately see what the output will be for each given
page. You will have to experiment until you find the right options. You can
then apply these options at once to a group of pages or to all pages, by selecting the pages in the thumbnail column and pressing Apply To and then To
selected pages.
There is also a free software package called djvulibre, but it cannot produce sufficiently
well compressed DJVU files.
33
This is a rather large package; there exists a stripped-down version that takes only about
20MB on the hard disk.
34
Now choose the Text tab as shown above. In that tab, set Pages per dictionary = 1000 (if this consumes too much RAM on your computer, or if this is
too slow, set to 200 or 300 instead of 1000). Save the custom profile under
a new name, say Bitonal-1. Do the same for the Scanned (600dpi) profile if
you need to encode books with color drawings.
Now run the Document Express Workflow Manager. Load all the TIFF pages
into it. In the Job name field, write the name of the book if you want. Choose
the previously created custom profile in the list Raster profile.
35
Then click to the Output tab (the tabs are at the bottom of the window). In
the list Separate document(s) choose One document only. Tick the box
under Enable at far left. Wait until the encoding is finished. You can also
look at the Log tab to watch the progress. Thats all; the DJVU file is created.
Do not delete the TIFF files yet! You may need to encode again if the DJVU
file has some error. Also, the TIFF files are useful for OCR purposes (see
section 6).
The result of DJVU encoding is a multipage DJVU file containing the entire ebook. You should rename that file to something sensible; not just math1.djvu.
At the very least, the file name should contain the authors name, the title of
the book, the publication year, and/or the ISBN number if available. This is
just a little work, but it will be so much easier to share that file on the Internet
if its name is sensibly chosen.
FineReader 9 is now available but it cannot add OCR to DJVU files, and there is no
DjvuOCR support for FR 9.
36
This program has several functions; for example, DjVu Decoder will produce
TIFF files out of DJVU in case you deleted your TIFF files, or if you are working
with somebody elses DJVU file. For now, you will use only the Manual mode
OCR manager. Click that, and you get the following window.
Select the directory where the FineReader batch is located in the FineReader
Project directory field. Output OCR text file will be the name of the new file;
it doesnt matter what that name is. Tick the Burn DJVU file box and select
the DJVU file below; it means that the OCR data will be inserted (burned)
into the DJVU file. Click Process, wait a few minutes, and thats all. Now
the DJVU file is full-text searchable!
37
38
The second way to add hyperlinks is semi-automatic, using the program DJVU
Hyperlinks Editor.15 Run the program and you will see the following window.
First you need to specify options for the hyperlinks Then you need to specify
the page range (
) in which the table of contents is located in the
DJVU file. These are DJVU page numbers, which may be different from the
page numbers printed in the book and in the table of contents (e.g. because
there are some pages taken by the cover and by the front matter). To compensate for this, usually one needs to add a certain offset to the page number; for
15
39
instance, page 10 in the printed book may be actually page 11 in the DJVU
file because one page is taken by the cover.16 Then you need to enter the
corresponding offset into the box
Similarly, one can create hyperlinks in the subject index. One needs to select
. The default entry
a different entry in the drop box
as shown means Table of contents. Other entries mean that you want to
process the subject index. The same settings apply.
After finishing the processing, one should view the DJVU file and check that
the hyperlinks were added correctly. The program relies on the OCR text for
determining the page numbers for hyperlinks. So any errors in OCR may lead
to errors in the position or targeting of the hyperlinks.
Download site
Status
www.irfanview.com
free
scantailor.sf.net
free
www.djvu-soft.narod.ru
free
www.djvu-soft.narod.ru
free
www.djvu-soft.narod.ru nonfree
www.abbyy.com
trial
djvuocr.ucoz.ru
free
www.djvu-soft.narod.ru
free
16
This is the Russian convention where the page numbering starts right away from the first
page of the book. In the Western typography the front matter usually has separate roman
numbering, so typical offsets will be not 1 but between 10 and 20.
40
Figure 7: After clicking on page layout while on the first page. The big
question marks on the thumbnails mean that these pages have not yet had
this step (page layout) performed on them. The Alignment symbols mean
the centering or flush-centering of the page in various directions. Press on
them to see immediately what effect these options would have on the final
appearance of the page.
41
Figure 8: After the draft run you are again at the first page. The big question
marks on the thumbnails are gone.
42
Figure 9: After you click on select content you can inspect the content rectangle. In most cases (like on this page), the content is detected perfectly.
43
Figure 10: In this case the content rectangle is too small. You need to adjust
it by dragging with the mouse.
44
Figure 11: The content rectangle is correct but very small (see on left). The
default page alignment will flush this rectangle to the top of the page and
center it, which is not what is desired.
45
Figure 12: An enlarged content rectangle (left) produces good page layout
(right) like in the original printed page.
46
Index
A3 scanner, 12
color plates, 38
deskewing, 11
DJVU, 4, 33
dictionary, 34
OCR layer, 36
rearrange pages, 38
FineReader
problems, 5
illustrations, 4
IrfanView, 7
JPEG, 8
digital artifacts, 9
problems, 8
kromsating, 14
quality, 3, 4
ScanKromsator, 5, 13
cutters, 15
draft run, 14
main run, 18
picture zones, 18
scanning, 11, 12
disk space, 12
greyscale, 4
with digital camera, 5
ScanTailor, 5, 19
TIFF, 8
upsampling, 4, 18
using Linux, 40
VueScan, 9
47