0% found this document useful (0 votes)
12 views

Document Recong

This document proposes an automated system for recognizing and classifying documents with horizontal lines. The system consists of four main components: a digitizer to convert documents to digital images, a preprocessor to scale and clean up images, a feature extractor to locate and adjust line segments, and a line pattern classifier to identify document types based on pre-stored models. The rest of the paper provides background on document analysis techniques and describes each component of the proposed system in more detail.

Uploaded by

Sai Raja G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Document Recong

This document proposes an automated system for recognizing and classifying documents with horizontal lines. The system consists of four main components: a digitizer to convert documents to digital images, a preprocessor to scale and clean up images, a feature extractor to locate and adjust line segments, and a line pattern classifier to identify document types based on pre-stored models. The rest of the paper provides background on document analysis techniques and describes each component of the proposed system in more detail.

Uploaded by

Sai Raja G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

An automated system for In this paper, an automated system which is

capable of recognizing and classifying documents


document recognition* with horizontal lines is proposed. These kinds of
documents are used in various situations where
Wei-Chung Lin and Yu-Jen Eugene Feng clients are asked to fill out forms. The rest of the
Department of E/ectrica/ Engineering and Computer paper is organized as follows: In the next section,
Science, Northwestern University, Evanston, a survey on the state-of-the-art of the applications
/L 60208, U.S.A. of pattern recognition and image analysis techniques
(Received March 1989) to office automation is presented. This is followed
by a brief overview of the proposed system. Then.
the preprocessor, the feature-extractor, and the
ABSTRACT classifier subsystems are detailed, and finally, some
In this paper, an automated document recognition experimental results are reported.
system is presented. The distinctive feature of this
system is that it recognizes a document based on
the invariant horizontal line segments instead of A SURVEY OF C O M P U T E R VISION
extracting the key words in the document. In other TECHNIQUES FOR AUTOMATED
words, it avoids the more complicated character- D O C U M E N T ANALYSIS
recognition or text-understanding techniques. The The state-of-the-art technologies of computer vision
system consists of four major components: digitizer, and pattern recognition have been applied to
preprocessor, feature-extractor, and line-pattern character and document recognition. In this section,
classifier. The digitizer converts input documents to a survey of these techniques is described. Several
digital bit-map images. The preprocessor scales the commercial systems for document understanding
digitized image to a suitable level of resolution and are also presented.
eliminates artifacts produced by the digitizer. The
feature-extractor locates line segments in the scaled
image and adjusts them to minimize skew and shift Character recognition
effects. The line-pattern classifier takes the adjusted Character-recognition techniques have been explored
line segments as input and checks the model extensively during the past few years. These
database to decide whether the unknown document techniques can be categorized into two classes: those
can be classified as one of the prestored model for recognizing machine-printed characters and
classes. Experimental results are given to demon- those for recognizing handwritten characters. By
strate the performance of the system. using an OCR (Optical Character Reader), the
characters generated by machines can easily be
recognized, once they have been extracted from the
INTRODUCTION original documents. The older-generation OCRs
With the advent of pattern recognition and digital require a special-type font-optimizer for machine
image analysis technology and the decreasing cost recognition. Several newly-built OCRs can recognize
of computer hardware, it becomes feasible and elite, courier, and other popular typewriter fonts.
economical to design special-purpose machine although the typist who prepares the document
vision systems for office automation. Some typical must often follow special document formatting
and important tasks in office environments such as procedures. Well-organized documents printed by
identification and classification of document types, machines can be recognized cost-effectively with the
extracting information of interest from documents, OCR devices. The characters can be encoded and
inspection of document image quality, confirmation transmitted to another device (such as a computer
of page order, and recognition and understanding or a laser printer) so that they can be stored,
of text are good candidates for automation ~-6. The processed for future use, or printed.
automated systems can remove human operators Hand-printed characters are more difficult to
from tedious jobs which might potentially impair recognize because they do not have the regularities
their productivity because of boredom or visual of machine-printed ones. In Reference 7, a model
fatigue. to simulate human writing is proposed. In this
model, hand-printed character strings can be
* Research supported in part by the Bell & Howell Company. considered to be composed of connected strokes.
0952 1976/89/02012(~ I I$2.00
'~ 1989 Pineridge Ltd
120 Eng. Apl01i. of AI, 1989, Vol. 2, June
An automated system for document recognition." W.-C. Lin and Y.-J. E. Feng

These strokes consist of connected line segments. vertical, right slant, horizontal, and left slant. The
These line segments are then decomposed into X, Y tense is used to analyse the relationships between
components and distorted by a random number the past and the present continuation to determine
generator. Hand-printed characters can also be if the present contour is a boundary of characters.
recognized by strokes 8. Another approach uses These two features are used to segment hand-printed
direction changes in curves to recognize numerals 9. characters in documents.
These changes are structured and recorded. These
recorded data are then used to classify numerals.
Usually both machine-printed and hand-printed
Document-understanding
characters are mixed in documents. Before recog- Document-understanding is a special technique
nizing characters, machine-printed and hand- which makes machines understand the arrangements
printed regions on a document have to be separated, and the contents of documents. It involves all the
so a specific tool can be applied to recognize techniques mentioned above. After documents are
characters in a particular region. An algorithm understood (i.e., well analysed) by machines, they
which can perform such tasks is proposed in can be stored or transmitted in a more efficient
Reference 10. Documents are first segmented into format than raw images. A system is proposed in
regions which can be either machine-printed or Reference 2 for understanding documents. The
hand-printed. After the regions in the document system segments a document into smaller regions
have been segmented, machine-printed and hand- with only one type of data (such as text, graphics
printed regions are then distinguished by the or halftone images) in each region. Regions with
regularities within each one of them. different characteristics are processed in different
After regions have been extracted, and distin- ways. Relationships between regions are also
guished as machine-printed or hand-printed, recorded. After documents are analysed, they are
characters in each region have to be separated so stored in the encoded form. MACSYM 16 is a
that character-recognition techniques can be knowledge-based, event-driven, parallel system
applied to each of these regions. The character- for document-understanding developed in Japan.
segmentation problem for machine-printed charac- EXPRESS 15 is a system developed under the
ters can be formulated and classified as a philosophy of MACSYM. After documents are
pitch-estimation problem and a character-sectioning understood, further processing, such as confirmation
decision problem. Several pitch-based character- of the page order or identification of the document
segmentation methods have been developed in order type, can be performed.
to process incorrect characters which cannot be
recognized by OCRs. These methods are basically
applied to constant-pitch characters. Most of them OVERVIEW OF THE SYSTEM
are based on standard character widths and The proposed system consists of four major com-
topological features detected from the character ponents: digitizer, preprocessor, feature-extractor,
geometries. One by one, the beginning and the end and line-pattern classifier. The digitizer converts
positions of characters are determined by these input documents to digital bit-map images. The
different bits of information 1~'12. In OCR-based preprocessor scales the digitized image to a suitable
systems, it is necessary to identify variable- and level of resolution, and eliminates artifacts produced
constant-pitch characters, and segmentation of by the digitizer. The feature-extractor locates line
these characters is an important factor for successful segments in the scaled image, and adjusts them to
recognition. In Reference 13, a least-square-error minimize skew and shift effects. The line-pattern
function is introduced to estimate variable pitch classifier takes the adjusted line segments as input,
text. The problem of segmentation is more difficult and checks the model database to decide if the
in hand-printed characters. A knowledge-based unknown document can be classified as one of the
system is proposed in Reference 14 to tackle this prestored model classes. The block diagram of the
problem. General knowledge about characters and proposed system is shown in Figure 1. In the phase
textlines, and specific knowledge about the of model-base construction, the training samples
documents the system deals with constitute the which are the prototypes (usually blank forms) of
knowledge base. The approach proposed in different documents are fed through, the scanner, the
Reference 15 uses tendency and tense as features to preprocessor, and the feature-extractor step by step.
segment characters. The tendency consists of The final result for each different document, in the

Eng. Appli. of AI, 1989, Vol. 2, June 121


An automated system for document recognition." W.-C. Lin and Y.-J. E. Feng
Cla~ Mcn,~r~lli 0
the image. This situation occurs on both the
digitized and the scaled images. The second step of
the preprocessing module is then to eliminate these
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
areas. These procedures are detailed in the following
I...- ~ l e m subsections.

Sealing
The digitizer is a 200 DPI scanner. If the document
Figure 1 Block diagram of the proposed system is printed by a 300 DPI or 400 DPI printer,
1-pixel-wide horizontal lines will not be digitized
form of line segment set, is stored in a database. In perfectly, i.e., they are likely to be broken. If they
the classification phase, an unknown document is are broken, the preprocessing module tries to
first processed by the three modules identical to reconnect them. As is shown in Figure 3, these lines
those in the training phase. The resultant line are broken into several line segments isolated by
segment set is then fed into the line-pattern classifier white pixels. These white pixels are isolated
to have its class membership determined. horizontally, even though they are connected
vertically. The way to bridge these broken segments
is to find these isolated white pixels horizontally
THE S C A N N E R and replace them with black ones. It is assumed
The scanner converts documents to digital images. that these are artifacts produced by the scanner
In the proposed system, a scanner developed by the because it is impractical to print such things on a
Bell & Howell Company is used for this purpose. document to make part of the line segments
The resolution of the scanner is 200 DPI (dots per invisible. To speed up the subsequent processings,
inch). For A4-sized documents, the resolution of nonoverlapping 8 x 8 windows are used to 'shrink +
digitized images is 1664 × 1892. (Actually, the size each 8 x 8 pixel set to one pixel. In order to preserve
of A4 paper is 8.5 × 11.0 inches, so the resolution horizontal lines and in the meantime to eliminate
should be 1700 × 2200. But the scanner reduces the characters, the scaling process is done as follows:
size of the document about 97 percent horizontally, in each window, the number of black pixels in each
and not the whole document is digitized vertically. row is counted, and the maximum number of these
This can be seen from Figure 2.) The digitized image
is then transmitted to the expansion memory of an , - 1040NR I u.s..~,..,,o~, .,,.. m~..,,. T.. R.,.,. ~ -__-- -
P.,-~,.~,",.~. Ji " r ' - n ...... , - - - , ....
. , 9. . . . . . . . . ~
I ~ ~--~-~+~
IBM-PC/AT via its D M A channel for processing. , ~_. _ ,, U~¢II'I
•I . . . . - ..... J -. L~_"='--, ' ~ - -"
Eight pixels are grouped into a byte for fast
transmission and efficient storage. This image is then
taken from the memory, and stored in the hard-disk.
A typical digitized image is shown in Figure 2.

THE P R E P R O C E S S I N G M O D U L E
The preprocessing module scales the digitized image
,,l... I /I tl:--c-=-
to a suitable level of resolution, and eliminates =-

artifacts in it to improve the performance of Ihl . • +tllll.lmmlLlI.a,


. . Iillmtlmlllw+ll
. . .
. . . .
. . . .
. . . £1'1~ - " 1 /

subsequent processes. The first step is to scale down


ll m u ( - T_ ~ - I ; I I m l i m I l q l l J m ~ m l m l p l J l l m ~ l + l l , ~ LLII |
the digitized image to 236×208, using 8 × 8 q~l~ llm4mml m e lllll~lWlllll lll~14kQ I t f m l l , IOD
:: ~ ' ~ . . . - ' - - - . . . . . .
lJl I
,.,

windows. As can be seen from Figure 2, the scanner ,1i1! .....


4 Ik
--:=--"="
t',mlmll, l l m lml I
,,.,
ll4,r m m
;.
lit+
always creates a horizontal line on the top of the
image, and a black area at the bottom of the image. jill:: .................
In a case where a small-size document is scanned, If.I . . - +.: . : :.: . . . ~ .l J . l -. - - l # l l l , l l l
llll. . . . . . - ........
since the scanner fills the blank area with black I I= - ~ ~,,~;: ~..~z_,~ ,,
pixels, the black area at the bottom of the image is II i I I -- l ii ~ : ~ 4 l l ii1+ i f i l l i il l 4 + i m j+

large. Moreover, if the document is not wide enough,


a black area will appear at the left or fight margin of Figure 2 A typical scanned image produced by the scanner

122 Eng. Appli. of AI, 1989, V o l . 2, J u n e


A n automated system for d o c u m e n t recognition: W.-C. Lin and Y.-J. E. Feng

scanned document and is recorded for later use.

T
; IQJLI]H~ • • "~*.,'Hr, • -~ .~.t.n, n~,-.,.. . . . .
~ , :'.., .* • = •.. ,.*...* ,i, **~. , . - •
This situation is shown in Figure 6,

FEATURE EXTRACTOR
The feature extractor locates and restores horizontal
I ~ II * I **Ill* lie *l 0.
lines in the scaled image. It consists of two parts:
j ,| =
. ~. . , - :.; % ,.. : ; ~ . , :,,..,, ,,,~, .:~ "-.~ i ~,~ : - -,,.u---
i
f ~.,• II
. . . . ".* . *. .s . I. ,". .* . ,' . *@,..l L l
.o . ; , . . , w,~,,..
- - 1 the line-tracing module and the line-adjustment
: r =IP~; . . . . i ~* •
m D

, L~.s-

I
I,
, j , : ,i ,
~ . . . ... . ~
......
...
~ -"L"- ~ -, - - _ ~, ,..!n,. -
, Ir~ LL,
__ •


• •
• •
I .J .' "' ~ : : : : . " ....... ' : ": . . . . . . . ,, *, _="------=-
•_ , , . ',:, ;It ,.,.'*,4 ', ..... , .... ~,,.
t .I ~. :'.:., ':' ,. ::,,j::;,:-.....,
l" ~ S *
,, ",; ,,
l l* * lille • . I: *
~
~l
• • • •
.,, L ...... ~. . . . . . . . . . ,
: . .I ., *. [ .. . . *e.
. . , .•. . . . ., ..,, . . . . . . • • • •
I', ,o , *.L , n,',,,.-.. I,I
q l :::,..L
| .... ,. . . . . . . .. ..~.e. * l * .I. • .... " .e • • • •

....... • • • 0 • 0 0 0 0 0
Figure 3 An example showing some broken horizontal lines
• • 0 0 0 • O
counts is recorded. It is then compared with a
threshold. If it is larger than the threshold, the
corresponding pixel in the scaled image is set to
Figure 4 An example showing how a horizontal line passes
black, otherwise, it is set to white. The threshold is through a window

based on the following assumption: if there is a


horizontal line passing through the window, the
maximum count is at least 7, as is depicted in ;'3~PNP .,i ~,,..~.,.,,,., ,, .,,-,,r,, ..
'~-!'.:'::,. -.. - r. :,'7": ...... "'". t__ ",/-~...,_--
Figure 4. A typical scaled image is shown in Figure .I i . ," n • - , - , i,. • ,., . ,, ,, ~ . . . , -- *,-,c-~--.l~ o Fw"~,~

5. As is shown in the figure, horizontal lines are


preserved while characters are all scaled to isolated
small black areas.

Redundant area removal


As discussed previously, it is possible to have "-- " " .... t ~ =, ~ . . - ? " . . . . . _~"lp" itr~" n
redundant areas in an original digitized image, or - - " - - - - - - - - - ~ - - - - . ~-.__
I..IM. i~ r .__i

a scaled image. Because there is always a horizontal


- I~.~
line at the top of the image, it can be used as a base I
l
Ib Io
:41. e i i L ,.,0
I *~ O Oel*'l ..
• . I - -
•l .lie%
|. '* • ¢.. ~tA'l~l IIM| ; .~i. ,.. loll li
line for tracing redundant areas. Firstly, it traces • • °
, " : ".,','. , , ; ; ~ , , . d : 7 , : . , , ~ , ........ I'
each column of the image from top to bottom and . I . r . . . . . ..
.... ., "L" " " .". . . " ' " " " ;I
o h oI i I. • lie,o, • . l : I."

eliminates any black pixel connected to the first black .

;~ "1

• l *1
I "
I. "h
I Iq ,,1 .~

** I I Ih,
'~
I
.,

Ji
*n

'~ q I h -I -
II Ill i* i.* I h'l i 1.'% •
one found. Of course, the first black pixel is always |L

'"'
;''*1.*
4" ':
e ~l*l
'" " ' ' : * ' ' ' * ' L "
**1 e ~ el * l . I I
" II
II L*l* idlli l " l * ~ l "*
on the first row. Secondly, it traces each column of
. . :.*. . .I .I . log . . . . . .m.I. l I r l . ~ ~ ":.,..,.::
. . . .¢*1" .I . . ~.-.-. . -. -. .~. , , II
~
l l i
the image from bottom to top and eliminates any ,, ',~'L I " ; ; , . - * , . , . . . . . . . . . . . . . . ,
i i% °" I~ I*l I II011 iPl*
black pixel connected to the first one found. The '~'i ,,.,.
.. '* ,_.:.
. . . . '.,.. . , '.,,.
,.', '",..o
~ , ' . ~. . . . . .
coordinates of the last eliminated pixel in each
qolumn are recorded. The maximum Y coordinates
among these recorded pixels "indicates the size of the Figure 5 A typical scaled image

Eng. Apple. of AI, 1989, Vol. 2, June 123


An automated system for document recognition." W.- C. Lin and Y.-J. E. Feng

; "J~LaN~ • a ="".,'Hrl ¢ , * . p p ~ , p,, . . , . . . . . .


Line adjustment
After all straight line have been extracted, the next
task is to compensate for the rotational and
translational distortions introduced during scan-
ning. The first step is to find the 'real' lines. Because
the line-tracing module has three allowed directions
to trace, there are possibilities that more than one
line will be found even through there is only one
line in the scaled image. The reasons for this are as
follows:

I IL dl ~ "[k I II1"'1 Ii m ,, t i
i. Thicker lines are likely to occupy more than one
I I
I
~d|e I
el II~I
i iI
I ~
llil
I III IIM I !
I
• II " " * * IIlI II row in the scaled image.
,: ,,,';I ~ 1 1:I~"
I
......• 4a b l .......
I Ill I I. I • .i
,....... I' 2. Some characters are connected to line segments.
I ~ . . . . . . . . . ' ,~,p.. i" ,, 21
II ~11 I i I. • ll.i I • I I _l •
• I'.,[.%
",. " " t . . , , , . • . . I "-- 3. The document is skewed when it is scanned.
l¢ I II I" I'" I h'I I I.'b •
ii
•1
I
i •
!l°'l
i I.
oi*
I "1
* i I • I . I
I I I I I~11 I I I • .tl
dlOl= e,I ., I I In ,= °
Ib I ~ l
• .
II
,,
I I
"

These problems are solved by the method to be
described in the following subsections.

Merging lines
This process checks the spatial relationships of any
two lines found in the scaled image, and merges
them if certain conditions are satisfied. There are
Figure 6 Document size determination three possible operations on a pair of lines:
1. No operation on both lines.
module. Since documents are seldom fed into the 2. Discard the shorter line which is connected with
scanner in the upright manner, horizontal lines are the longer one.
likely to be broken into several connected segments 3. Merge these two lines to form a longer one.
with different Y coordinates (as seen in Figures 2
and 5). The module traces the scaled image and These operations are executed according to three
finds all straight lines. It starts from the upper-left conditions to be described. Let/2 and E be the two
c o m e r of the scaled image and searches each column lines being checked, where i # j , i,j <~n, and n is the
from top to bottom looking for black pixels. If it number of lines in the scaled image. Let Lsx, Lsy,
finds one, it searches toward the right of the image Lex, and Ley be the start and end point coordinates
to find another black pixel. There are three possible of a line, respectively.
trace directions--right, upper right, and lower The three conditions are as follows:
right--as shown in Figure 7. The tracing module i
1. If IL,r-L~yI>2 i
and ILe,-L~,I>2, n o action is
always traces to the right if possible. If it fails to do taken.
so, it tries the other two directions. At each • 2. If l~~<lj, L~x/>L~x and Li~x~L~x, where li and lj
movement, a counter which counts the number of are lengths of lines i and j, respectively, then L~
traced pixels is incremented by one. It stops when is merged with L j, i.e., L~ is removed from the line
no further movement can be made. The value of the segment set.
counter is then checked to see if it is larger than a 3. If L~xi < L~x and L~x i > L ~~, then let L ~i _- L~x,
j
threshold. This threshold should be determined by Leyi__ L~y and remove/] from the line segment set.
(1) the average minimum line length in all the
documents, (2) the resolution of the digitizer, and Note that conditions 2 and 3 are checked only if
(3) the resolution chosen for processing. If the value condition 1 is not satisfied. These conditions and
is larger than the threshold, the start and the end operations are illustrated in Figure 8.
coordinates are recorded. If the traced line is long
enough, pixels on it are not allowed to be traced Horizontal line restoration and position
again. The tracing module terminates after all black normalization
pixels have been traced and all straight lines After the merging process, only one-pixel-wide lines
extracted. are left in the image. These lines are usually skewed

124 Eng. Appli. of AI, 1989, Vol. 2, June


An automated system for document recognition." W.-C. Lin and Y.-J. E. Feng

/----. f-.

2 3 "-~--
"-A '~ , \ \
,: : Jr \ /,

Priority of three possible An example showing how


tracing directions a horizontal line is tm¢~

Original image Traced line*


Figure 7 Tracing near-horizontal lines in a scaled image

(see Figures 2 and 5). The final step of feature The skew angle of the document is estimated as
extraction is to rotate the entire image according to follows:
the skew angle estimated by these lines and then i
Ley ~ Lsy
i
normalize its position by shifting the minimum
L'~x--L',~,
bounding rectangle of these lines to the upper-left
corner. where Mi is the slope of the ith line, i~< n, where n
is the number of lines in the scaled image, and L~x,
Skew angle estimation for horizontal line restoration. L~r, L~x, and L~v, are the x, and y coordinates of
The skew angle estimation is based on the lines in both ends of the line, respectively. A histogram of
the scaled image. These are assumed to be the slope is constructed for estimating the skew
horizontal lines in the original document. The slopes angle. The range of angles is partitioned into a finite
of all the lines are computed first. They will be zero number of intervals. The interval that has the largest
if the document is not skewed, but if the document count is considered as the range the skew angle is
is skewed, the slopes of longer lines will not be zero. most likely to be in. The average slope in that

Eng. Appli. of AI, 1989, Vol. 2, June 125


An automated system for document recognition." W.-C. Lin and Y.-J. E. Feng

is to enclose the lines with their minimum bounding


Line 1 Line 2
rectangle, and shift them to the upper-left corner
(a) of the image. This process is described as follows:
Let ~ be a line to be restored, where i = 1. . . . . N,
where N is the number of lines in the scaled image.
Let L~ and L~r be the coordinates of the left end
1 Line 2
point of E. Let Xmin=min(L~), i = 1. . . . . N, and
Ymin = m m' ( L ri) , i = l . . . . . N. Then, L x
/ _- L ix X m i n ,
- -

and Liy=Lir - Ymin. These normalized horizontal


(b)
lines are used as features for classification.

THE LINE P A T T E R N CLASSIFIER


Lin¢l Line2
The goal of this module is to match the unknown
input document with the prestored documents in
Lil~3 the database. Although all the horizontal lines have
(c) been restored to their original positions, there are
Figure 8 Adjusting lines in a scaled image. (a) No operation on still some problems caused by the following facts:
either line. (b) Discard Line 1. (c) Merge Line 1 and Line 2 to form
Line 3 1. Some short lines may appear after customers fill
out the document;
interval is used to estimate the skew angle. The angle 2. The restoration process may cause some
is then used to adjust the lines. The procedure used displacement (around 2 to 3 pixels in the Y
to compensate for the skew effect in the scaled image direction);
is described in the next paragraph. 3. The scanner may not produce an identical image
Assume that D. is the interval where the maximum during each scan of a particular document. Small
count o c c u r s . Mat , (the average slope of all the lines shifts in both X and Y directions are likely to
within D.) is computed as: occur during each scan.
The proposed line pattern classifier is robust
2M, enough to deal with all the problems mentioned
May- i
N above. The first step in this module is to find the
nearest line in each model class for every line in the
where i is in the index and N is the number of unknown document. To find these lines in each
lines with slopes in the interval D,. The estimated model class, differences between each line in the
skew angle 0=v is then computed by tan-l(Mav). unknown document and each line in every model
After 0av has been computed, all the lines can be class are first computed by Equation (1).
adjusted by following process. Assume that
XoJd, Yo~dare the coordinates of either end of any Difference(Lm, Lu)= Wx x I L m x - Lux]
line in the scaled image and Xn=,, Yn=, are the new + ~ x tLmy-- Lu,I
coordinates for the point. Then the new coordinates
X.=, and Y.~, are computed as follows: + Wz x l g m t - Lu,I (1)

X.~w = R x cos(Oold - O~v), where Lm is a line in a model class, Lu is a line in


the unknown document, Lmx, Lmr, Lmt are the
Y ~ w = R x sin(0old - 0 . , ) , respective x, y, and I components of Lm; Lux, Lur,
where Lu~ are the respective x, y, and l components of
Lu; and Wx, Wy, Wt are weighting factors.
R = x/Xoia + Yore,
2 and 0el d = tan . All the lines in a model class are checked to find
\gold/
the one with the minimum difference from a specific
Under this transformation, lengths of lines as well line in the unknown document. If the computed
as distances between lines are preserved. difference is small enough, a matched pair is
established between the unknown document and the
Position normalization. After all the lines have been model class. The number of matched pairs in a
restored to their horizontal positions, the final step model class is also recorded. This process is

126 Eng. Appli. of AI, 1989, Vol. 2, June


An automated system for document recognition." W.-C. Lin and Y.-J. E. Fang

performed for every model class. The threshold used will be very small (see Figure 9). In such cases, we
to determine if a matched pair is found or not cannot classify the document successfully. Another
depends on the resolution of the scaled image, and measurement is then introduced to solve this
the m a x i m u m tolerance that the scanner can bear. problem: the number of matched pairs. A factor,
The differences of all the matched pairs in each referred to as the correlation factor, is derived from
model class are then averaged by the number of this number and computed by:
matched pairs in that model class. This averaged i i
value is defined to be the distance between the Nmatched X Nmatched , (3)
unknown document and the model class. A N'm N.o
measurement of how likely the unknown document where Nm=ch=d
i is the number of matched pairs in
is to belong to a model class is derived from these model class M~, N~ is the number of lines in model
distances. It is computed by: class M~, and N.. is the number of lines in the
i
A-D i
(2) unknown document. Note that neither Nmatch©-----~dnor
A
i
g'm
where A is the sum of all the distances, and Di is Nmatohed is allowed to exceed 1.0. If this happens,
the distance between the unknown document and
the model class M~. the ratio is set to 1.0.
The model class that has the smallest distance In a case where there are various sizes of
will have the largest value computed from Equation documents, adding size to the feature vector for
(2). In most cases, the 'right' model class will have differentiating documents m a y improve the per-
the largest value, but this is not always the case. formance of the classification module. A size
When a"document with only a few lines is compared measurement S~ is thus introduced to adjust the
with a model with lots of lines (or vice versa), it is similarity measurement. S~ is computed by:
possible that all the lines in the unknown document
(or the model, according to which one has fewer ( Sz="k"°wn, if SZi > SZ=nk. . . .
lines) will be matched, and the computed distance SZi
Si= I SZi , (4)
( _ ~ , if SZi ~<S Z, nk. . . .

(~ty I liflC-
where SZ..k.o. n and SZ~ are the sizes computed from the
preprocessing module for the unknown document and
'\ MATCH the model class M~, respectively.
Model I S~ will be close to one when the document and the
'\\ model M~are of the same size. For example, both of them
may be of size A4 or personal checks. But, if a personal
=12~- 2931. 1 3 9 3 - 3871
check is compared with an A4-sized document, this

I
-5 ÷6
= 11
L~ of l i f ~ ,' measurement will drop to about 0.27 (3 inches divided
__ (0. 290,3S31
by 11 inches). This should be weighted heavily in the
-1293- ~ 1 + 1 ] a 7 - ~ 3 1
,' / =3÷4 final similarity measurement.
= 7
These two values are then multiplied by the value
Model 2 ..~ATCH computed from Equation (2) to get the final similarity
//
measurement with model class M~, i.e.,
i N i
Li = A --D i x NmatchedX matchedX ,3i. (5)
OelyI ~ ///'/
(o,293, I A Nim N.n
The final step in this module is to classify documents
based on Lis. The unknown document should be
Unknown
Document
classified to be of the same form as the model class with
Figure 9 An example showing why distance is not enough the largest similarity measurement, provided that the
for classification, d ( M 1, U) = ( 2 9 8 - 2 9 3 ) + ( 3 9 3 - 3 8 7 ) = 5 + 6 = 11. measurement is greater than a predetermined threshold,
d(M=,U)=(293-290)+(387-383)=3+4=7. So d ( M = , U ) is
smaller than d ( M v U), but, as we can see from the figure, the
say th 1. If the largest similarity measurement is less than
unknown document should be classified as M~ rather than /14= another threshold, say th 2 (th 1 >th 2), the unknown

Eng. Appli. of A1,'1989, Vol. 2, June 127


An automated system for document recognition: W.-C. Lin and Y.-J. E. Feng

document will not match with any model class in the


database. Another situation is that the largest similarity
measurement fails between th 1 and th 2. Under this
circumstance, a relative measurement is used to classify
the unknown document. This ratio is used to see if the
largest and the second largest similarity measurement are
• \
nearly the same. If so, the unknown document cannot '~ OI ~X \l
be classified as a member of any one of the model classes. • J X i \ ,

Otherwise, it can still be classified as a member of the MONITOR


'
document class with the largest L. This measurement can KEYBOARD
be obtained by computing the ratio between the largest
PC/AT
L and the second-largest L. If this ratio is greater than
a threshold (th 3), the unknown document is classified
as a member of the class with the largest L. Otherwise Figure 10 The configuration of the proposed system
the unknown document will not be considered to be
matched with any model class in the database. This can
be described quantitatively as follows: In the particular experiment to be described,
The unknown document can be classified to model twenty-five documents are used as models. They are
class M, if and only if transformed to line patterns and stored in the model
database. Filled forms are used as unknown samples
I. L.,~>th 1, (6)
for testing. In this experiment, we achieved zero
or error rate and the similarity measurement in the
cases of successful classification is usually over 50.
2. L= ~ [th 2, th 1] and L., >/th 3, (7)
L. In a case of rejection (i.e., the prototype of the
where unknown document is not prestored in the
database), it is usually under 35. However, we
L,,= Max({L i, i= 1. . . . . n, where n is the number of cannot exclude the possibility that two documents
model classes}), (line patterns) may be very similar. In this case, even
and if they have different titles and printed information,
L.-= Max({L} - {L,,}). it is still hard to identify them from line patterns.
Otherwise, the unknown document cannot be classified In the experiments conducted, these twenty-five
as a member of any one of the document classes in the documents were selected arbitrarily from tax return
database. In our experiments, th 1 =50, th 2=35 and forms. There is no special preference for any kind
th 3= 1.5 are used. These thresholds are determined of forms; the only restriction is that there must be
empirically, although they can be determined by at least one horizontal line in the document.
distances between model classes in the feature space. For convenience, only 5 forms are used for
illustration (although there are still twenty-five
documents in the model database, only the five
model classes with largest similarity measurements
are reported). Document BELL1 is not stored in
the database. As a result, the line pattern classifier
cannot classify it as belonging to any class in the
EXPERIMENTAL RESULTS
model base. As can be seen from the report of the
In the proposed system, a scanner developed by the program (Figure 13), the largest similarity measure-
Bell & Howell C o m p a n y is used as the digitizer to ment is only 24 (the maximum similarity measure-
digitize documents. The configuration of the ment can be as large as 100), and the ratio between
proposed system is shown in Figure 10. A program the largest and the second largest similarity
written in C takes the digitized image from the hard measurement is only 1.19. For document USA2, it
disk, scales it to a 236×208 image, extracts is classified to F1040SA2 successfully with a
horizontal lines in it, makes necessary adjustments, similarity measurement of 53. Documents UNR1,
and performs the classification. Currently, the UEZ and U10402 are also classified successfully, as
average computation time for the system to classify is shown in Figure 13. Line patterns for these models
a document is about I00 seconds. and the unknown documents are shown in Figures

128 Eng. Appli. of AI, 1989, Vol. 2, June


An automated system for document recognition." W.-C. Lin and Y.-J. E. Feng

11 and 12, respectively. The classification results are Sample : belll


shown in Figure 13. Model name Match Mismatch Difference Simi|azit~
FIO4ONR3.1n2 22 21 328.000 3t,
FIO4OSAI.In2 27 16 463.000 34
FlO40-2.1n2 27 16 413.000 32
flo4onr2.1n2 25 18 388.000 25
CONCLUSION fl040a3.1n2 13 30 140.000 23
No m a t c h found
T i m e = O: 2
In this paper, a system using horizontal lines as
sample : uez
features for document classification is proposed. The
M o d e l name Match Mismatch Difference Similarity
advantage of this system is that the complicated FI040EZI.In2 15 2 13.000 66
FI040NR5.1n2 2 15 5.000 7
FlO4ONR3.1n2 2 15 1.000 0
F1040-2.1n2 2 15 8.000 0
FI04OSAI.In2 2 15 24.000 0
The input d o c u m e n t is c l a s s i f i e d as a m e m b e r of F I O 4 O E Z I . I n 2 .
==~=mltmllmmlm=
T i m e - 0: 1
*************************************************************
Sample : usa2

m Model n a m e Match Mismatch Difference Similarity


i F1040SA2.1n2 25 0 56.000 52
FIO40NR3.1n2 13 12 120.000 24
ILlO40-l.ln2 12 13 234.000 19
FIO40SAI.In2 14 ii 122.000 18
F1040A3 FI040NR1 ILSOSI2.1n2 6 19 132.000 11
T h e input d o c u m e n t is c l a s s i f i e d as a m e m b e r of F I 0 4 0 S A 2 . 1 n 2 .
T i m e - 0: 1
*************************************************************
Sample : u10402

l Model name Match Mismatch Difference Similarity


F1040-2.1n2 28 1 75.000 63
i

=i fl040nr2.1n2
FI04ONRI.In2
FI04OSA2.1n2
II040ES2.1n2
The i n p u t d o c u m e n t
T i m e - 0: 1
21
15
12
6
14
17
23
8 214.000
262.000
110.000
136.000
28
11
9
5
is c l a s s i f i e d as a m e m b e r of F 1 0 4 0 - 2 . 1 n 2 .

F1040-2 F1040SA2 *************************************************************

Sample : unrl

Model name Match Mismatch Difference Similarity


FI040NRI.In2 48 7 366.000 64
FI040NR3.1n2 17 38 942.000 17
FI040SA2.1n2 22 33 350.000 16
F1040-2.1n2 15 40 227.000 8
fl04Onr2.1n2 15 40 196.000 8
T h e input d o c u m e n t is c l a s s i f i e d as a m e m b e r of F 1 0 4 O N R I . I n 2 .
T i m e - 0: 2

Figure 13 Repots of classification results of Figures 11 and 12


FI040EZ1
Figure 11 Five line patterns in the model database
processes such as character segmentation, character
recognition and hand-printed/machine-printed text
segmentation techniques are not necessary. The
system is divided into four major components: the
digitizer, the preprocessor, the feature-extractor,
and the line-pattern classifier. The digitizer digitizes
input documents to digital images. The preprocessing
BELL] UEZ module scales the digitized image to a suitable level
of resolution for processing, and eliminates artifacts
produced by the digitizer. The feature extractor
locates near-horizontal lines from the scaled image,
makes adjustments to them, and restores them to
horizontal positions. These lines enclosed by their
minimum bounding rectangle are then shifted to the
USA2 U10402 upper-left corner of the image for position
normalization. If the system is in the learning mode,
these horizontal line features are stored in the model
database. If the system is in the classification mode,
the line pattern classifier takes the extracted
-i horizontal line features as input, checks the model
m
database to see if the document can be classified as
UNR1 a member of any existing model class. In all our
Figure 12 Five line patterns of unknown documents experiments, the error rate of the system is zero.

Eng. Appli. of AI, 1989, Vol. 2, June 129


An automated system for document recognition: W.- C. Lin and Y.-J. E. Feng

The time needed for classifying an unknown be accomplished within approximately 20 seconds.
document is about 100 seconds with a general- If hardware can be developed, real-time classification
purpose computer. If special-purpose hardware is is also possible.
available, the time is expected to be much less. The
time distribution needed for each module in the
REFERENCES
existing system is shown in Figure 14. As is shown
in the figure, storage and retrieval of digitized 1 Wolber, O. A syntactic omni-font character recognition system.
Proc. IEEE 9th Int. Conf. on Pattern Recognition. pp. 168-173
images is the most time-consuming part of the entire (1986)
process. It takes about 20 seconds to store the image, 2 Wong, K. Y., Casey, R. G. and Wahl, F. M. Document analysis
and hence about the same amount of time is needed system. I B M J. Res. Develop., 26 (6), 647-655 (November 1982)
3 Kubota, K., Iwaki, O. and Arakawa, H. Document understanding
for retrieving it. About 40 seconds can be saved if system. Proc. IEEE 9th Int. Conf. on Pattern Recognition. pp.
images are taken from the expansion memory 612-614 (1986)
4 Wahl, F. M., Wong, K. Y. and Casey, R. G. Block segmentation
instead of the hard disk. At the present stage, the and text extraction in mixed text/image documents. Computer
programs are written in different modules and data Graphics and Image Processing, 20, 375-390 (1982)
5 Meynieux, E., Seisen, S. and Tombre, K. Bilevel information
are passed through the hard disk. These need extra recognition and coding in office paper documents. Proc. IEEE 9th
program loading time, initialization time, and data Int. Conf. on Pattern Recognition, pp. 442-445 (1986)
reading and writing time. All of these can be 6 Kida, H., Iwaki, O. and Kawada, K. Document recognition system
for office automation. Proc. IEEE 9th Int. Conf. on Pattern
shortened to some extent by putting programs Recognition, pp. 446-448 (1986)
together, and passing data via memory only. 7 Hidai, Y., Ooi, K. and Nakarnura, Y. Stroke reordering algorithm
for on-line hand-written character recognition. Proc. IEEE 9th
Writing some time-critical routines in assembly Int. Conf. on Pattern Recognition, pp. 934-936 (1986)
language is another way to cut the processing time. 8 Knodo, S. and Attachoo, B. Model of handwriting process and
its analysis. Proc. IEEE 9th Int. Conf. on Pattern Recognition. pp.
Through these modifications, the whole process can 562-565 (1986)
9 Huang, J. S. and Chuang, K. Heuristic approach to handwritten
numeral recognition. Pattern Recognition, 19 (i), 15-19 (1986)
l0 Srihari, S. N., Wang, C.-H., Palumbo, P. W. and Hull, J. J.
/ \ Recognizing address blocks on mail pieces: specialized tools and
problem-solving architecture. AI Magazine, Winter, 25-40 (1987)
11 Casey, R. G. and Jih, C. R. An on-line mini-computer based system
IN W
for reading printed text aloud. IEEE Trans. Systems, Men and
Cybernetics, S M C 4 (January 1978)
12 Schurman, V. A. Image Pattern Recognition. Springer-Verlag, New
MS
York (1974)
13 Tsuju, Y. and Asai, K. Character image segmentation. SPIE Vol.
504, Applications of Digital Image Processing VII, pp. 2-9 (1984)
14 Babagnchi, N., Tsukamoto, M. and Aibara, T. Knowledge aided
character segmentation from handwritten document image. Proc.
IEEE 9th Int. Conf. on Pattern Recognition, pp. 573-575 (1986)
15 Tampi, K. R. and Chetlur, S. S. Segmentation of handwritten
characters. Proc. IEEE 9th Int. Conf. on Pattern Recognition.
pp. 684-686 (1986)
16 lnagaki, K., Kato, T., Hiroshima, T. and Sakai, T. MACSYM: a
hierarchical parallel image processing system for event-driven
pattern understanding of documents. Pattern Recognition, 17 (1),
15 S 85-108 (1984)
15 S
10--15 $ 17 Casey, R. G. and Nagy, G. Recursive segmentation and
classification of composite character patterns. 6th Int. Joint Conf.
I I on Pattern Recognition, pp. 1023-1026 (1982)
18 Higashino, J., Fujisawa, H., Yakano, and Ejiri, M. A.
knowledge-based segmentation method for document under-
Seutain8 Filterin= F==ture Classification standing. Proc. IEEE 9th Int. Conf. on Pattern Recognition. pp.
S¢llini~ ~¢tr=tion 745-748 (1986)
Figure 14 Time distribution of the entire process for processing a 19 Schurman, J. Reading machines, Proc. IEEE 6th Int. Conf. on
document. Total time.~l 12 sec Pattern Recognition, pp. 1031-1044 (1982)

130 Eng. Appli. of AI, 1989, Vol. 2, June

You might also like