A Robust Algorithm For Text String Separation From Mixed Text/Graphics Images
A Robust Algorithm For Text String Separation From Mixed Text/Graphics Images
6, NOVEMBER 1988
Abstruct-An automated system for document analysis is extremely The initial processing stage in an automated document
desirable. A digitized image consisting of a mixture of text and graph- analysis system requires conversion of paper-based
ics should be segmented in order to represent more efficiently both the
areas of text and graphics. This paper describes the development and
graphics/text to a digital bit-map representation. Wher-
implementation of a new algorithm for automated text string separa- ever the primary goal of the automated document analysis
tion which is relatively independent of changes in text font style and system is interpretation of graphic data, text strings pres-
size, and of string orientation. The algorithm does not explicitly rec- ent within the digitized document must first be separated
ognize individual characters. The principal components of the algo- from the graphics in order that subsequent processing
rithm are the generation of connected components and the application
of the Hough transform in order to group together components into
stages may operate exclusively on the graphic informa-
logical character strings which may then be separated from the graph- tion. The extracted text may be stored separately for input
ics. The algorithm outputs two images, one containing text strings, and to a character recognition system for later retrieval or re-
the other graphics. These images may then be processed by suitable vision. Since document types vary widely in style and
character recognition and graphics recognition systems. The perfor- content of both graphic and text data, an algorithm to per-
mance of the algorithm, both in terms of its effectiveness and compu-
tational efficiency, was evaluated using several test images. The results
form text string removal must be able to accommodate
of the evaluations are described. The superior performance of this al- documents containing text of various font styles and sizes.
gorithm compared to other techniques is clear from the evaluations. Further, the documents may contain text strings which are
Zndex Terms-Connected component analysis, document analysis, intermingled with graphics, and text characters which are
image processing, image segmentation, image understanding, text and similar in size or shape to graphics. In general, text strings
graphics separation, text recognition. may be of any orientation in the image; not simply hori-
zontal or vertical but possibly diagonally aligned.
INTRODUCTION Several algorithms for text string separation have been
reported in the literature [6]-[9]. However, many of these
T HE move from paper-based documentation towards
computerized storage and retrieval systems has been
prompted by the many advantages io be gained from the
algorithms are very restrictive in the type of documents
they can process and are therefore not useful in a general
electronic document environment. Document update automated document analysis system. For example, the
and revision is efficiently achieved in the computerized combined symbol matching algorithm [7] is sensitive to
form. For efficient processing and storagz of documents, changes in text font style and size: the algorithm must be
however, it is necessary to generate a description of run once for each font type using different parameters,
graphical elements in the document rather than a bit-map words of less than three characters which are embedded
in order to decrease the storage and processing time. Thus, in longer strings are not removed, and the string removal
increasing emphasis is being placed on the need for the is only along a specified orientation. The Block Segmen-
realization of computer-based systems which are capable tation technique [8] broadly classifies regions into text or
of providing automated analysis and interpretation of pa- graphics; i.e., characters which lie within predominantly
per-based documents [ l]-[3]. Much of the attention paid graphics regions are classified as graphics. The Bley al-
to automated document analysis systems in the literature gorithm [9] is also sensitive to variations in text font style
has been in relation to engineering drawings and diagrams and size; the algorithm breaks connected components into
[4], [5]. Such systems provide a means for originating subcomponents, which makes it difficult to process the
technical information (text and graphics) in a digital form components for graphics recognition. Thus there is no
suitable for interactive graphics editing, reproduction, and single algorithm which is robust enough to segment im-
distribution. ages containing mixed graphics and text, with multiple
font styles and sizes and strings of arbitrary orientation.
Manuscript received August 8, 1986; revised June 12, 1987. Recom-
mended for acceptance by J. Kittler. This work was supported in part by This paper describes a new robust algorithm for text string
the National Science Foundation under Grant ECS-8307445, separation from mixed text/graphics images.
L. A. Fletcher was with the Department of Electrical Engineering, the
Pennsylvania State University, University Park, PA 16802. He is now with
Bell Communications Research, 331 Newman Springs Road, Red Bank, A ROBUSTALGORITHM FOR TEXTSTRING SEPARATION
NJ 07701.
R. Kasturi is with the Department of Electrical Engineering, the Penn-
A robust algorithm for text separation has been de-
sylvania State University, University Park, PA 16802. signed to separate text strings from graphics, regardless
IEEE Log Number 8823861. of string orientation and font size or style. The algorithm
- 0 0 - =+#==?- I
0
lie along any given straight line. By grouping components
into strings associated with a particular line, the compo-
Fig. 2 . Rectangles enclosing the connected components of test image 1. nents can be ordered according to their distance along the
line. Further grouping may then take place by examining
characters. In general, a mixed text/graphics image will the distance between characters. By comparing the inter-
produce connected components of widely varying areas. character distance with the interword gap and interchar-
The larger connected components represent the larger acter gap thresholds, the string can be segmented into log-
graphic components of the image. It is desirable to locate ical character groups, that is, into logical words and
and discard large graphics in order to restrict processing phrases. Only if components belong to a logical character
to components which are candidates for members of text group can they be regarded as members of a valid text
strings, thereby improving both accuracy and processing character string. Such components are identified and are
speed. separated from the image. These steps are explained in
By obtaining a histogram of the relative frequency of detail in this and the following sections.
occurrence of components as a function of their area, it is The Hough transform is a line to point transformation
possible to set an area threshold which broadly separates [ 131 which, when applied to the centroids of connected
the larger graphics from the text components. By correct components in an image, can be used to detect sets of
threshold selection, the largest of the graphics can be dis- connected components that lie along a given straight line.
carded, leaving only the smaller graphics and text com- A line in the Cartesian space (x,y ) given by the equation
ponents as members of the working set of connected com-
ponents. The area threshold selection procedure must
p = x cos 6 + y sin 8 (1)
ensure that the threshold lies outside that part of the his- is represented by a point ( p , 0 ) in the Hough domain.
togram which broadly represents the set of text charac- Similarly, every point (x,y ) in the Cartesian space maps
ters. The threshold itself is determined using the area his- into a curve in the Hough domain. Thus to locate all the
togram (as opposed to presetting it to a value) although a connected components that are collinear, the Hough
priori knowledge about the types of documents being pro- transform is applied to the centroids of the rectangles en-
cessed (such as the amount of text data compared to closing each connected component. All the curves corre-
graphics), if available, is helpful in determining this sponding to the collinear components intersect at the same
threshold. For documents that contain a substantial point ( p , e ) , where p and 8 specify the parameters of the
amount of text and some large graphics, the most popu- line.
lated area will, in general, represent mostly text compo- In the discrete case, the Hough domain is a two-dimen-
nents or, at least, small graphics. By ensuring that the sional array representing discrete values of p and 8. The
threshold is set above the most populated area, A,, the resolution along the 8 direction is set to one degree. The
possibility of discarding members of the text character set resolution along the p direction is treated as a variable R .
is avoided. This alone is not adequate to process images Optimal selection of R is critical to correct grouping of
that contain text strings of different sizes (it is likely that connected components. This is because in any text string,
the most populated area corresponds to the smallest sized the centroids of all characters belonging to a phrase are
characters). Thus, a second parameter, the average area not necessarily collinear. For any phrase, character
Aavg, is computed. The area threshold is then set at five heights vary (upper, lower case) and character positions
times the larger of the two parameters A , and Aavg.The vary in relation to each other (ascenders, descenders). In
histogram is searched to locate components that are larger general, the centroids of characters belonging to the same
FLETCHER AND KASTURI: TEXT STRING SEPARATION 913
phrase will lie in the range k6,. from the axis of the line 2) Set the Hough domain resolution R to 0.2 X H,,T.
connecting the phrase members. However, 6, is a function Set a counter to zero.
of the text string height. Thus, it is important to select a 3) Apply the Hough transform to all components in the
Hough domain resolution, before applying the transform, working set for 8 in the range: 0" I8 I5 " , 85" I0 I
which will cause the majority of component centroids be- 95", 175 5 0 I 180". (In our implementation p is al-
longing to a phrase to be grouped into the same cell. The lowed to have negative values and Om,, = 180.) Set the
optimal resolution is dependent on the heights and relative running threshold RT, = 20.
positions of characters within a string. Too large a value 4) For each cell having a count greater than RT,., per-
for R may cause several parallel strings to be grouped into form steps 5-10.
a single cell (over-grouping). If R is too small then con- 5 ) Form a cluster of 11 p cells (constant e), including
nected components belonging to the same string may be the primary cell, centered around the primary cell.
grouped into different cells (undergrouping). The value of 6 ) Compute the average height of components in the
R should depend on the average local character height, cluster Ha.
Ha, for a line 1. However, since Ha cannot be determined 7) Compute the new clustering factor fclus = H , / R .
until characters have actually been detected as belonging 8 ) Re-cluster f fclus cells centered around the primary
to a collinear string, and since R must be known before cell, including the primary cell.
such grouping, an initial estimate for R is made based on 9) Perform string segmentation (explained in detail in
the average height, H,,, of all connected components in a later section).
the current working set. The resolution is then set to: R 10) Update the Hough transform by deleting the con-
= 0.2 x If%,$. tributions from all components discarded in step 9.
The resolution R allows for some perturbations in the 11) Decrement RT, by one. If RT, is greater than 2, go
collinearity of components belonging to text strings. to step 4.
However, this will not be adequate to group all ascenders 12) If the counter is equal to 1 then stop. Otherwise
and descenders in a character string. Further, the value of compute the Hough transform for the remaining compo-
R , which is based on average character height, will not be nents for 8 in the range: 0" I8 < 180". Reset RT, to
optimal for grouping text strings of different heights. 20. Increment the counter by one. Go to step 4.
Thus, clustering in the Hough domain may become nec- Step 10 above is performed in order to "refresh" the
essary for each primary cell: Here, "primary cell" refers Hough domain. If the Hough array is not continually up-
to the cell containing the majority of the connected com- dated, discarded characters contribute to cell counts de-
ponents in a string. In order to group together all com- spite having been removed. This can lead to a severe deg-
ponents belonging to a string, neighboring cells must be radation in processing speed since information about
clustered into the primary string. The required degree of discarded components is unnecessarily extracted from the
clustering, ticlus,
will depend on the local character height. Hough domain. Thus, the refreshing of the Hough do-
However, when strings are initially extracted, I1 cells main is designed to improve performance.
centered around the primary cell are grouped for exami-
nation. Once the string has been extracted, the local clus- Logical Grouping of Strings into Words and Phrases
tering factor can be determined. Fig. 3 shows the connected components associated with
The average local character height for a string Ha is the line lpr. Note that the centroids of these components
used to determine the degree of clustering around the pri- do not necessarily lie on the line Zpr. It is necessary to
mary cell. This will ensure that all ascenders and de- order these components according to their distance along
scenders in a string are included in the text string group- the line [from a reference point, say ( 0 ,y o ) ] . That is, for
ing. The clustering factor is then set to H a / R . each component the distance along the line lpr is calcu-
Since the vast majority of documents contain text strings lated.
which are parallel to the document axes, it is advanta- In order to locate words and phrases, the positional re-
geous to remove these strings first. To allow for orienta- lationships between connected components are examined.
tion errors during digitization, a tolerance of five degrees It is the intercharacter gap and interword gap thresholds,
is allowed in processing horizontal and vertical lines. T, and T,, which are used to determine component group-
Strings at other orientations are processed after consider- ing. These parameters are functions of the character
ing the horizontal and vertical strings. Furthermore, more height. In computing this, an average of the height of sev-
populated cells in the Hough domain are processed before eral local characters should be used rather than simply
processing less populated cells. This order of string pro- using the height of the component currently under exam-
cessing gives greatest weight to longer strings. If shorter ination. For example, the component under examination
phrases or words were treated before longer ones, a deg- may be a period at the end of a string, which would have
radation in performance would be observed. very small height. For this reason the four neighboring
In summary, extraction of strings from the Hough do- (two on each side) characters are used, along with the cur-
main is performed as follows: rent component, to determine the thresholds T, and T,,.
1) Calculate the average height of components H,, in These operations are explained in the following para-
the working set. graphs.
~
914 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 10, NO. 6, NOVEMBER 1988
* *
* * r - 1
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
A
I
I Fig. 6. Test image 2 .
I
/* I
- I
[ 5 ] H. Bunke, Automatic interpretation of lines and text in circuit dia- of Technical Staff with Bell Communications Research, Red Bank, NJ. He
grams, in Pattern Recognition Theory and Applications, J . Kittler, is currently working in the Switching Analysis and Reliability Technology
K. S . Fu, and L. F. Pau Eds. Boston, MA: D. Reidel, 1982, pp. Center, where he is involved with the development of criteria to assure the
297-3 10. reliability and quality of digital switching systems and telecommunications
[6] L. T . Watson, K. Arvind, A. W. Ehrich, and R. M. Haralick, Ex- equipment deployed by the regional Bell telephone companies.
traction of lines and regions from grey tone line drawing images,
Pattern Recognition, vol. 17, pp. 493-506, 1984.
[7] W. H. Chen, W . K. Pratt, E . R. Hamilton, R . H. Wallis, and P. J .
Capitant, Combined symbol matching facsimile data compression
system, Proc. ZEEE, vol. 68, pp. 786-796, 1980.
[8] F. M . Wahl, M. K. Y . Wong, and R . G . Casey, Block segmentation
and text extraction in mixed text/image documents, Cornput. Vision,
Graphics, Image Processing, vol. 20, pp. 375-390, 1982.
191 H. Bley, Segmentation and preprocessing of electrical schematics Rangachar Kasturi (M82) was born in Banga-
using picture graphs, Cornput. Vision, Graphics, Image Processing, lore, India, in 1949. He received the B.E. degree
vol. 28, pp. 271-288, 1984. in electrical engineering from Bangalore Univer-
[IO] L. A. Fletcher, Text string separation from mixed text/graphics im- sity in 1968 and the M.S.E.E. and Ph.D. degrees
ages, M.S. thesis, Dep. Elec. Eng., Pennsylvania State Univ., Aug. from Texas Tech University, Lubbock, in 1980
1986. and 1982, respectively.
111) A. Rosenfeld and A. C. Kak, Digital Pictureprocessing, vol. 2, 2nd Dr. Kasturi is an Associate Professor of Elec-
ed. New York: Academic, 1982. trical Engineering at the Pennsylvania State Uni-
1121 J. P. Foith, C. Eisenbarth, E. Enderle, H. Geisselmann, H. Rings- versity, where he was an Assistant Professor from
hauser, and G. Zimmermann, Real-time processing of binary im- 1982 to 1986. From 1978 to 1982 he was a Re-
ages for industrial applications, in Digital Image Processing Sys- search Assistant at Texas Tech University and was
tems, L. Bok and Z . Kulpa, Eds. Berlin: Springer-Verlag, 1981. engaged in research in multiplex holography and digital image processing.
[I31 W. K. Pratt, Digital Zmage Processing. New York: Wiley, 1978, From 1976 to 1978 he was the Engineering Officer at the Visvesvaraya
pp. 523-525. Industrial and Technological Museum, Bangalore, and from 1969 to 1976
he was with the Bharat Electronics Ltd., Bangalore, as a Research and
Development Engineer. His current research interests are in the applica-
tions of image analysis and artificial intelligence techniques for map data
Lloyd Alan Fletcher was born in Birmingham, processing, graphics recognition, and document analysis. He has directed
England, in 1962. He received the B . S . degree in several projects sponsored by NSF, Digital Equipment Corporation, AT&T,
physics with microelectronics and computing from and the Applied Research Laboratory. He has published a number of papers
the University of Leicester, England, in 1984. in journals and conference proceedings. He is the author of a book chapter
From August 1984 until August 1986 he attended and is coeditor of a book on image analysis applications to be published by
the Pennsylvania State University, where he was Marcel Dekker.
a teaching and research assistant, receiving the Dr. Kasturi is a member of OSA, SPIE, Eta Kappa Nu, and Sigma Xi.
M.S. degree in electrical engineering in 1986. At
Penn State his research interests included com-
puter vision and digital image analysis.
Since September 1986 he has been a Member