OPTICAL CHARACTER RECOGNITION (OCR)
Contents
Introduction Stages in OCR MATLAB Implementation Steps in MATLAB Implementation Android Implementation Advantages Applications Conclusion References
2
INTRODUCTION
Motivation:Text detection and recognition in general have quite a lot of relevant application for automatic indexing or information retrieval such document indexing, content-based image retrieval, and license car plate recognition which further opens up the possibility for more improved and advanced systems.
OCR:OCR is the mechanical or electronic translation of images of handwritten, typewritten or printed text (usually captured by a scanner) into machine-editable text.
Aims and Objectives
OCR
Recognition Recognize each of the character in the detected text region using a suitable algorithm
Segmentation Separate the text region into its individual characters.
The goal of Optical Character Recognition (OCR) is to classify optical patterns (often contained in a digital image) corresponding to alphanumeric or other characters.
STAGES IN OCR
TRAINING
Pre - processing
Feature Extraction
Model Estimation OCR Pre - processing TESTING Feature Extraction
Classification
PRE-PROCESSING
The raw data is subjected to a number of preliminary processing steps to make it usable in the descriptive stages of character analysis. Pre-processing aims to produce data that are easy for the OCR systems to operate accurately. The main objectives of pre-processing are : Binarization Noise reduction Stroke width normalization Skew correction Slant removal
BINARIZATIO N
Binarization (thresholding) refers to the
conversion of a gray-scale image into a binary image. Two categories of thresholding are: Global - picks one threshold value for the entire document image which is often based on an estimation of the background level from the intensity histogram of the image. Adaptive (local) - uses different values for each pixel according to the local area information
Noise Reduction Normalization
Noise reduction improves the quality of the document. Normalization provides a tremendous reduction in data size, thinning extracts the shape information of the characters. Two main approaches:
Filtering (masks) Morphological Operations (erosion, dilation, etc)
6/10/13
FEATURE EXTRACTION
In feature extraction stage each character is represented as a feature vector, which becomes its identity. The major goal of feature extraction is to extract a set of features, which maximizes the recognition rate with the least amount of elements. Due to the nature of handwriting with its high degree of variability and imprecision obtaining these features, is a difficult task.
MODEL ESTIMATION
Given
labelled sets of features for many characters, where the labels correspond to the particular classes that the characters belong to, we wish to estimate a statistical model for each character class.
CLASSIFICATION
According
to Tou and Gonzalez, The principal function of a pattern recognition system is to yield decisions concerning the class membership of the patterns with which it is confronted. In the context of an OCR system, the recognizer is confronted with a sequence feature patterns from which it must determine the character classes.
MATLAB IMPLEMENTATION Flowchart:Preprocess
Segmentation
Recognition
Snapshot of MATLAB Application
Make Template
To create templete.mat to be use for classification:
36 images of characters Size = 60 X 55
Matrix siz e 55 X 60 X 36 Saved as template .mat
14
Preprocess
Raw Image Noise Filter Binarize
Resizing
Baunding
Complimenting
Preprocessed Image
Segmentation Connected Components
The segmentation character involves the following steps:
Scan the image from left to right to find on pixel. If on pixel been found, all on pixel connected to the detected on pixel will be extracted segmented as a pixel. The process will be repeated until it reach end right of the image.
Corr2
Where is the mean of the input matrix i and is the mean of the input matrix j. 0 < r < 1 1 mean i and j is exactly same while 0 mean the i and j not same at all.
Recognition - Template Correlations
temp = templates(:,:,j); in = chars(:,:,i); allCorrs(j) = corr2(temp, in); Source image Image Template
allcorrs(j)
0.82011
0.57395
0.43850
Android Implementation
The same OCR application we build for Android devices named MyOCR using open source library Tesseract by Google.
Tesseract Background:Developed on HP-UX at HP between 1985 and 1994 to run in a desktop scanner. Came neck and neck with Caere and XIS in the 1995 UNLV test. Never used in an HP product. Open sourced in 2005. Now on: http://code.google.com/p/tesseract-ocr Highly portable.
Tesseract OCR Architecture
ADVANTAGE
Increase efficiency OCR Recover valuable space Eliminates Retyping Need Greater accessibility
APPLICATION
Document reading machines used for Banking Applications Automatic address reading for mail sorting
Data entry
Process automation
Aid for blind Automatic number-plate readers
Other Applications
Text Entry
Page readers for text entry, mainly used in Office Automation
Typical errors in OCR
Variations in shape
Due to serifs and style variations.
Deformations
Caused by broken characters, smudged characters and speckle.
Variations in spacing
Due to subscripts, superscripts, skew and variable spacing
Mixture of text and graphics
Future needs
Need constrained OCR will be decreasing Omni font OCR Systems
Recognition of manually produced documents
Recognition of entire words instead of individual
REFRENCES
http://www.uri.edu/~hansenj/projects/ele585/OCR / J.T. Tou and R.C. Gonzalez, Pattern Recognition Principles, Addison-Wesley Publishing Company, Inc., Reading, Massachusetts, 1974
M. Szmurlo, Masters Thesis, Oslo, May 1995, (users.info.unicaen.fr/~szmurlo/papers/masters/ master.thesis.ps.gz)
THANK YOU
Special Thanks To: Google.com Mathwoks.com