Document Recong
Document Recong
These strokes consist of connected line segments. vertical, right slant, horizontal, and left slant. The
These line segments are then decomposed into X, Y tense is used to analyse the relationships between
components and distorted by a random number the past and the present continuation to determine
generator. Hand-printed characters can also be if the present contour is a boundary of characters.
recognized by strokes 8. Another approach uses These two features are used to segment hand-printed
direction changes in curves to recognize numerals 9. characters in documents.
These changes are structured and recorded. These
recorded data are then used to classify numerals.
Usually both machine-printed and hand-printed
Document-understanding
characters are mixed in documents. Before recog- Document-understanding is a special technique
nizing characters, machine-printed and hand- which makes machines understand the arrangements
printed regions on a document have to be separated, and the contents of documents. It involves all the
so a specific tool can be applied to recognize techniques mentioned above. After documents are
characters in a particular region. An algorithm understood (i.e., well analysed) by machines, they
which can perform such tasks is proposed in can be stored or transmitted in a more efficient
Reference 10. Documents are first segmented into format than raw images. A system is proposed in
regions which can be either machine-printed or Reference 2 for understanding documents. The
hand-printed. After the regions in the document system segments a document into smaller regions
have been segmented, machine-printed and hand- with only one type of data (such as text, graphics
printed regions are then distinguished by the or halftone images) in each region. Regions with
regularities within each one of them. different characteristics are processed in different
After regions have been extracted, and distin- ways. Relationships between regions are also
guished as machine-printed or hand-printed, recorded. After documents are analysed, they are
characters in each region have to be separated so stored in the encoded form. MACSYM 16 is a
that character-recognition techniques can be knowledge-based, event-driven, parallel system
applied to each of these regions. The character- for document-understanding developed in Japan.
segmentation problem for machine-printed charac- EXPRESS 15 is a system developed under the
ters can be formulated and classified as a philosophy of MACSYM. After documents are
pitch-estimation problem and a character-sectioning understood, further processing, such as confirmation
decision problem. Several pitch-based character- of the page order or identification of the document
segmentation methods have been developed in order type, can be performed.
to process incorrect characters which cannot be
recognized by OCRs. These methods are basically
applied to constant-pitch characters. Most of them OVERVIEW OF THE SYSTEM
are based on standard character widths and The proposed system consists of four major com-
topological features detected from the character ponents: digitizer, preprocessor, feature-extractor,
geometries. One by one, the beginning and the end and line-pattern classifier. The digitizer converts
positions of characters are determined by these input documents to digital bit-map images. The
different bits of information 1~'12. In OCR-based preprocessor scales the digitized image to a suitable
systems, it is necessary to identify variable- and level of resolution, and eliminates artifacts produced
constant-pitch characters, and segmentation of by the digitizer. The feature-extractor locates line
these characters is an important factor for successful segments in the scaled image, and adjusts them to
recognition. In Reference 13, a least-square-error minimize skew and shift effects. The line-pattern
function is introduced to estimate variable pitch classifier takes the adjusted line segments as input,
text. The problem of segmentation is more difficult and checks the model database to decide if the
in hand-printed characters. A knowledge-based unknown document can be classified as one of the
system is proposed in Reference 14 to tackle this prestored model classes. The block diagram of the
problem. General knowledge about characters and proposed system is shown in Figure 1. In the phase
textlines, and specific knowledge about the of model-base construction, the training samples
documents the system deals with constitute the which are the prototypes (usually blank forms) of
knowledge base. The approach proposed in different documents are fed through, the scanner, the
Reference 15 uses tendency and tense as features to preprocessor, and the feature-extractor step by step.
segment characters. The tendency consists of The final result for each different document, in the
Sealing
The digitizer is a 200 DPI scanner. If the document
Figure 1 Block diagram of the proposed system is printed by a 300 DPI or 400 DPI printer,
1-pixel-wide horizontal lines will not be digitized
form of line segment set, is stored in a database. In perfectly, i.e., they are likely to be broken. If they
the classification phase, an unknown document is are broken, the preprocessing module tries to
first processed by the three modules identical to reconnect them. As is shown in Figure 3, these lines
those in the training phase. The resultant line are broken into several line segments isolated by
segment set is then fed into the line-pattern classifier white pixels. These white pixels are isolated
to have its class membership determined. horizontally, even though they are connected
vertically. The way to bridge these broken segments
is to find these isolated white pixels horizontally
THE S C A N N E R and replace them with black ones. It is assumed
The scanner converts documents to digital images. that these are artifacts produced by the scanner
In the proposed system, a scanner developed by the because it is impractical to print such things on a
Bell & Howell Company is used for this purpose. document to make part of the line segments
The resolution of the scanner is 200 DPI (dots per invisible. To speed up the subsequent processings,
inch). For A4-sized documents, the resolution of nonoverlapping 8 x 8 windows are used to 'shrink +
digitized images is 1664 × 1892. (Actually, the size each 8 x 8 pixel set to one pixel. In order to preserve
of A4 paper is 8.5 × 11.0 inches, so the resolution horizontal lines and in the meantime to eliminate
should be 1700 × 2200. But the scanner reduces the characters, the scaling process is done as follows:
size of the document about 97 percent horizontally, in each window, the number of black pixels in each
and not the whole document is digitized vertically. row is counted, and the maximum number of these
This can be seen from Figure 2.) The digitized image
is then transmitted to the expansion memory of an , - 1040NR I u.s..~,..,,o~, .,,.. m~..,,. T.. R.,.,. ~ -__-- -
P.,-~,.~,",.~. Ji " r ' - n ...... , - - - , ....
. , 9. . . . . . . . . ~
I ~ ~--~-~+~
IBM-PC/AT via its D M A channel for processing. , ~_. _ ,, U~¢II'I
•I . . . . - ..... J -. L~_"='--, ' ~ - -"
Eight pixels are grouped into a byte for fast
transmission and efficient storage. This image is then
taken from the memory, and stored in the hard-disk.
A typical digitized image is shown in Figure 2.
THE P R E P R O C E S S I N G M O D U L E
The preprocessing module scales the digitized image
,,l... I /I tl:--c-=-
to a suitable level of resolution, and eliminates =-
T
; IQJLI]H~ • • "~*.,'Hr, • -~ .~.t.n, n~,-.,.. . . . .
~ , :'.., .* • = •.. ,.*...* ,i, **~. , . - •
This situation is shown in Figure 6,
FEATURE EXTRACTOR
The feature extractor locates and restores horizontal
I ~ II * I **Ill* lie *l 0.
lines in the scaled image. It consists of two parts:
j ,| =
. ~. . , - :.; % ,.. : ; ~ . , :,,..,, ,,,~, .:~ "-.~ i ~,~ : - -,,.u---
i
f ~.,• II
. . . . ".* . *. .s . I. ,". .* . ,' . *@,..l L l
.o . ; , . . , w,~,,..
- - 1 the line-tracing module and the line-adjustment
: r =IP~; . . . . i ~* •
m D
, L~.s-
I
I,
, j , : ,i ,
~ . . . ... . ~
......
...
~ -"L"- ~ -, - - _ ~, ,..!n,. -
, Ir~ LL,
__ •
•
•
• •
• •
I .J .' "' ~ : : : : . " ....... ' : ": . . . . . . . ,, *, _="------=-
•_ , , . ',:, ;It ,.,.'*,4 ', ..... , .... ~,,.
t .I ~. :'.:., ':' ,. ::,,j::;,:-.....,
l" ~ S *
,, ",; ,,
l l* * lille • . I: *
~
~l
• • • •
.,, L ...... ~. . . . . . . . . . ,
: . .I ., *. [ .. . . *e.
. . , .•. . . . ., ..,, . . . . . . • • • •
I', ,o , *.L , n,',,,.-.. I,I
q l :::,..L
| .... ,. . . . . . . .. ..~.e. * l * .I. • .... " .e • • • •
....... • • • 0 • 0 0 0 0 0
Figure 3 An example showing some broken horizontal lines
• • 0 0 0 • O
counts is recorded. It is then compared with a
threshold. If it is larger than the threshold, the
corresponding pixel in the scaled image is set to
Figure 4 An example showing how a horizontal line passes
black, otherwise, it is set to white. The threshold is through a window
;~ "1
•
• l *1
I "
I. "h
I Iq ,,1 .~
** I I Ih,
'~
I
.,
Ji
*n
•
'~ q I h -I -
II Ill i* i.* I h'l i 1.'% •
one found. Of course, the first black pixel is always |L
•
'"'
;''*1.*
4" ':
e ~l*l
'" " ' ' : * ' ' ' * ' L "
**1 e ~ el * l . I I
" II
II L*l* idlli l " l * ~ l "*
on the first row. Secondly, it traces each column of
. . :.*. . .I .I . log . . . . . .m.I. l I r l . ~ ~ ":.,..,.::
. . . .¢*1" .I . . ~.-.-. . -. -. .~. , , II
~
l l i
the image from bottom to top and eliminates any ,, ',~'L I " ; ; , . - * , . , . . . . . . . . . . . . . . ,
i i% °" I~ I*l I II011 iPl*
black pixel connected to the first one found. The '~'i ,,.,.
.. '* ,_.:.
. . . . '.,.. . , '.,,.
,.', '",..o
~ , ' . ~. . . . . .
coordinates of the last eliminated pixel in each
qolumn are recorded. The maximum Y coordinates
among these recorded pixels "indicates the size of the Figure 5 A typical scaled image
I IL dl ~ "[k I II1"'1 Ii m ,, t i
i. Thicker lines are likely to occupy more than one
I I
I
~d|e I
el II~I
i iI
I ~
llil
I III IIM I !
I
• II " " * * IIlI II row in the scaled image.
,: ,,,';I ~ 1 1:I~"
I
......• 4a b l .......
I Ill I I. I • .i
,....... I' 2. Some characters are connected to line segments.
I ~ . . . . . . . . . ' ,~,p.. i" ,, 21
II ~11 I i I. • ll.i I • I I _l •
• I'.,[.%
",. " " t . . , , , . • . . I "-- 3. The document is skewed when it is scanned.
l¢ I II I" I'" I h'I I I.'b •
ii
•1
I
i •
!l°'l
i I.
oi*
I "1
* i I • I . I
I I I I I~11 I I I • .tl
dlOl= e,I ., I I In ,= °
Ib I ~ l
• .
II
,,
I I
"
•
These problems are solved by the method to be
described in the following subsections.
Merging lines
This process checks the spatial relationships of any
two lines found in the scaled image, and merges
them if certain conditions are satisfied. There are
Figure 6 Document size determination three possible operations on a pair of lines:
1. No operation on both lines.
module. Since documents are seldom fed into the 2. Discard the shorter line which is connected with
scanner in the upright manner, horizontal lines are the longer one.
likely to be broken into several connected segments 3. Merge these two lines to form a longer one.
with different Y coordinates (as seen in Figures 2
and 5). The module traces the scaled image and These operations are executed according to three
finds all straight lines. It starts from the upper-left conditions to be described. Let/2 and E be the two
c o m e r of the scaled image and searches each column lines being checked, where i # j , i,j <~n, and n is the
from top to bottom looking for black pixels. If it number of lines in the scaled image. Let Lsx, Lsy,
finds one, it searches toward the right of the image Lex, and Ley be the start and end point coordinates
to find another black pixel. There are three possible of a line, respectively.
trace directions--right, upper right, and lower The three conditions are as follows:
right--as shown in Figure 7. The tracing module i
1. If IL,r-L~yI>2 i
and ILe,-L~,I>2, n o action is
always traces to the right if possible. If it fails to do taken.
so, it tries the other two directions. At each • 2. If l~~<lj, L~x/>L~x and Li~x~L~x, where li and lj
movement, a counter which counts the number of are lengths of lines i and j, respectively, then L~
traced pixels is incremented by one. It stops when is merged with L j, i.e., L~ is removed from the line
no further movement can be made. The value of the segment set.
counter is then checked to see if it is larger than a 3. If L~xi < L~x and L~x i > L ~~, then let L ~i _- L~x,
j
threshold. This threshold should be determined by Leyi__ L~y and remove/] from the line segment set.
(1) the average minimum line length in all the
documents, (2) the resolution of the digitizer, and Note that conditions 2 and 3 are checked only if
(3) the resolution chosen for processing. If the value condition 1 is not satisfied. These conditions and
is larger than the threshold, the start and the end operations are illustrated in Figure 8.
coordinates are recorded. If the traced line is long
enough, pixels on it are not allowed to be traced Horizontal line restoration and position
again. The tracing module terminates after all black normalization
pixels have been traced and all straight lines After the merging process, only one-pixel-wide lines
extracted. are left in the image. These lines are usually skewed
/----. f-.
2 3 "-~--
"-A '~ , \ \
,: : Jr \ /,
(see Figures 2 and 5). The final step of feature The skew angle of the document is estimated as
extraction is to rotate the entire image according to follows:
the skew angle estimated by these lines and then i
Ley ~ Lsy
i
normalize its position by shifting the minimum
L'~x--L',~,
bounding rectangle of these lines to the upper-left
corner. where Mi is the slope of the ith line, i~< n, where n
is the number of lines in the scaled image, and L~x,
Skew angle estimation for horizontal line restoration. L~r, L~x, and L~v, are the x, and y coordinates of
The skew angle estimation is based on the lines in both ends of the line, respectively. A histogram of
the scaled image. These are assumed to be the slope is constructed for estimating the skew
horizontal lines in the original document. The slopes angle. The range of angles is partitioned into a finite
of all the lines are computed first. They will be zero number of intervals. The interval that has the largest
if the document is not skewed, but if the document count is considered as the range the skew angle is
is skewed, the slopes of longer lines will not be zero. most likely to be in. The average slope in that
performed for every model class. The threshold used will be very small (see Figure 9). In such cases, we
to determine if a matched pair is found or not cannot classify the document successfully. Another
depends on the resolution of the scaled image, and measurement is then introduced to solve this
the m a x i m u m tolerance that the scanner can bear. problem: the number of matched pairs. A factor,
The differences of all the matched pairs in each referred to as the correlation factor, is derived from
model class are then averaged by the number of this number and computed by:
matched pairs in that model class. This averaged i i
value is defined to be the distance between the Nmatched X Nmatched , (3)
unknown document and the model class. A N'm N.o
measurement of how likely the unknown document where Nm=ch=d
i is the number of matched pairs in
is to belong to a model class is derived from these model class M~, N~ is the number of lines in model
distances. It is computed by: class M~, and N.. is the number of lines in the
i
A-D i
(2) unknown document. Note that neither Nmatch©-----~dnor
A
i
g'm
where A is the sum of all the distances, and Di is Nmatohed is allowed to exceed 1.0. If this happens,
the distance between the unknown document and
the model class M~. the ratio is set to 1.0.
The model class that has the smallest distance In a case where there are various sizes of
will have the largest value computed from Equation documents, adding size to the feature vector for
(2). In most cases, the 'right' model class will have differentiating documents m a y improve the per-
the largest value, but this is not always the case. formance of the classification module. A size
When a"document with only a few lines is compared measurement S~ is thus introduced to adjust the
with a model with lots of lines (or vice versa), it is similarity measurement. S~ is computed by:
possible that all the lines in the unknown document
(or the model, according to which one has fewer ( Sz="k"°wn, if SZi > SZ=nk. . . .
lines) will be matched, and the computed distance SZi
Si= I SZi , (4)
( _ ~ , if SZi ~<S Z, nk. . . .
(~ty I liflC-
where SZ..k.o. n and SZ~ are the sizes computed from the
preprocessing module for the unknown document and
'\ MATCH the model class M~, respectively.
Model I S~ will be close to one when the document and the
'\\ model M~are of the same size. For example, both of them
may be of size A4 or personal checks. But, if a personal
=12~- 2931. 1 3 9 3 - 3871
check is compared with an A4-sized document, this
I
-5 ÷6
= 11
L~ of l i f ~ ,' measurement will drop to about 0.27 (3 inches divided
__ (0. 290,3S31
by 11 inches). This should be weighted heavily in the
-1293- ~ 1 + 1 ] a 7 - ~ 3 1
,' / =3÷4 final similarity measurement.
= 7
These two values are then multiplied by the value
Model 2 ..~ATCH computed from Equation (2) to get the final similarity
//
measurement with model class M~, i.e.,
i N i
Li = A --D i x NmatchedX matchedX ,3i. (5)
OelyI ~ ///'/
(o,293, I A Nim N.n
The final step in this module is to classify documents
based on Lis. The unknown document should be
Unknown
Document
classified to be of the same form as the model class with
Figure 9 An example showing why distance is not enough the largest similarity measurement, provided that the
for classification, d ( M 1, U) = ( 2 9 8 - 2 9 3 ) + ( 3 9 3 - 3 8 7 ) = 5 + 6 = 11. measurement is greater than a predetermined threshold,
d(M=,U)=(293-290)+(387-383)=3+4=7. So d ( M = , U ) is
smaller than d ( M v U), but, as we can see from the figure, the
say th 1. If the largest similarity measurement is less than
unknown document should be classified as M~ rather than /14= another threshold, say th 2 (th 1 >th 2), the unknown
=i fl040nr2.1n2
FI04ONRI.In2
FI04OSA2.1n2
II040ES2.1n2
The i n p u t d o c u m e n t
T i m e - 0: 1
21
15
12
6
14
17
23
8 214.000
262.000
110.000
136.000
28
11
9
5
is c l a s s i f i e d as a m e m b e r of F 1 0 4 0 - 2 . 1 n 2 .
Sample : unrl
The time needed for classifying an unknown be accomplished within approximately 20 seconds.
document is about 100 seconds with a general- If hardware can be developed, real-time classification
purpose computer. If special-purpose hardware is is also possible.
available, the time is expected to be much less. The
time distribution needed for each module in the
REFERENCES
existing system is shown in Figure 14. As is shown
in the figure, storage and retrieval of digitized 1 Wolber, O. A syntactic omni-font character recognition system.
Proc. IEEE 9th Int. Conf. on Pattern Recognition. pp. 168-173
images is the most time-consuming part of the entire (1986)
process. It takes about 20 seconds to store the image, 2 Wong, K. Y., Casey, R. G. and Wahl, F. M. Document analysis
and hence about the same amount of time is needed system. I B M J. Res. Develop., 26 (6), 647-655 (November 1982)
3 Kubota, K., Iwaki, O. and Arakawa, H. Document understanding
for retrieving it. About 40 seconds can be saved if system. Proc. IEEE 9th Int. Conf. on Pattern Recognition. pp.
images are taken from the expansion memory 612-614 (1986)
4 Wahl, F. M., Wong, K. Y. and Casey, R. G. Block segmentation
instead of the hard disk. At the present stage, the and text extraction in mixed text/image documents. Computer
programs are written in different modules and data Graphics and Image Processing, 20, 375-390 (1982)
5 Meynieux, E., Seisen, S. and Tombre, K. Bilevel information
are passed through the hard disk. These need extra recognition and coding in office paper documents. Proc. IEEE 9th
program loading time, initialization time, and data Int. Conf. on Pattern Recognition, pp. 442-445 (1986)
reading and writing time. All of these can be 6 Kida, H., Iwaki, O. and Kawada, K. Document recognition system
for office automation. Proc. IEEE 9th Int. Conf. on Pattern
shortened to some extent by putting programs Recognition, pp. 446-448 (1986)
together, and passing data via memory only. 7 Hidai, Y., Ooi, K. and Nakarnura, Y. Stroke reordering algorithm
for on-line hand-written character recognition. Proc. IEEE 9th
Writing some time-critical routines in assembly Int. Conf. on Pattern Recognition, pp. 934-936 (1986)
language is another way to cut the processing time. 8 Knodo, S. and Attachoo, B. Model of handwriting process and
its analysis. Proc. IEEE 9th Int. Conf. on Pattern Recognition. pp.
Through these modifications, the whole process can 562-565 (1986)
9 Huang, J. S. and Chuang, K. Heuristic approach to handwritten
numeral recognition. Pattern Recognition, 19 (i), 15-19 (1986)
l0 Srihari, S. N., Wang, C.-H., Palumbo, P. W. and Hull, J. J.
/ \ Recognizing address blocks on mail pieces: specialized tools and
problem-solving architecture. AI Magazine, Winter, 25-40 (1987)
11 Casey, R. G. and Jih, C. R. An on-line mini-computer based system
IN W
for reading printed text aloud. IEEE Trans. Systems, Men and
Cybernetics, S M C 4 (January 1978)
12 Schurman, V. A. Image Pattern Recognition. Springer-Verlag, New
MS
York (1974)
13 Tsuju, Y. and Asai, K. Character image segmentation. SPIE Vol.
504, Applications of Digital Image Processing VII, pp. 2-9 (1984)
14 Babagnchi, N., Tsukamoto, M. and Aibara, T. Knowledge aided
character segmentation from handwritten document image. Proc.
IEEE 9th Int. Conf. on Pattern Recognition, pp. 573-575 (1986)
15 Tampi, K. R. and Chetlur, S. S. Segmentation of handwritten
characters. Proc. IEEE 9th Int. Conf. on Pattern Recognition.
pp. 684-686 (1986)
16 lnagaki, K., Kato, T., Hiroshima, T. and Sakai, T. MACSYM: a
hierarchical parallel image processing system for event-driven
pattern understanding of documents. Pattern Recognition, 17 (1),
15 S 85-108 (1984)
15 S
10--15 $ 17 Casey, R. G. and Nagy, G. Recursive segmentation and
classification of composite character patterns. 6th Int. Joint Conf.
I I on Pattern Recognition, pp. 1023-1026 (1982)
18 Higashino, J., Fujisawa, H., Yakano, and Ejiri, M. A.
knowledge-based segmentation method for document under-
Seutain8 Filterin= F==ture Classification standing. Proc. IEEE 9th Int. Conf. on Pattern Recognition. pp.
S¢llini~ ~¢tr=tion 745-748 (1986)
Figure 14 Time distribution of the entire process for processing a 19 Schurman, J. Reading machines, Proc. IEEE 6th Int. Conf. on
document. Total time.~l 12 sec Pattern Recognition, pp. 1031-1044 (1982)