Ocr On A Grid Infrastructure: Project Synopsis
Ocr On A Grid Infrastructure: Project Synopsis
ON
OCR ON A GRID INFRASTRUCTURE
SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR
THE AWARD OF DEGREE OF
Bachelor of Technology
In
Information Technology
Axis Institute of Technology & Management, Kanpur
Submitted to : Submitted by :
Mr. Adesh Chandra Brajesh Kumar (1171913009)
(Assistant Professor) Aman Bhathiya(11719130)
Om Prakash Bharti (11719130)
Pooja Yaday (11719130)
Aditi Sharma (11719130)
INTRODUCTION
In the running world, there is growing demand for the software systems to recognize characters
in computer system when information is scanned through paper documents as we know that we
have number of newspapers and books which are in printed format related to different subjects.
These days there is a huge demand in storing the information available in these paper
documents in to a computer storage disk and then later reusing this information by searching
process. One simple way to store information in these paper documents in to computer system
is to first scan the documents and then store them as IMAGES. But to reuse this information it is
very difficult to read the individual contents and searching the contents form these documents
line-by-line and word-by-word. The reason for this difficulty is the font characteristics of the
characters in paper documents are different to font of the characters in computer system. As a
result, computer is unable to recognize the characters while reading them. This concept of storing
the contents of paper documents in computer storage place and then reading and searching the
content is called DOCUMENT PROCESSING. Sometimes in this document processing we need
to process the information that is related to languages other than the English in the world. For
this document processing we need a software system called CHARCATER RECOGNITION
SYSTEM. This process is also called DOCUMENT IMAGE ANALYSIS (DIA).
Thus our need is to develop character recognition software system to perform Document
Image Analysis which transforms documents in paper format to electronic format. For this
process there are various techniques in the world. Among all those techniques we have chosen
Optical Character Recognition as main fundamental technique to recognize characters. The
conversion of paper documents in to electronic format is an on-going task in many of the
organizations particularly in Research and Development (R&D) area, in large business
enterprises, in government institutions, so on. From our problem statement we can introduce the
necessity of Optical Character Recognition in mobile electronic devices such as cell phones,
digital cameras to acquire images and recognize them as a part of face recognition and
validation.
To effectively use Optical Character Recognition for character recognition in-order to
perform Document Image Analysis (DIA), we are using the information in Grid format. . This
system is thus effective and useful in Virtual Digital Librarys design and construction.
OBJECTIVE
The main purpose of Optical Character Recognition (OCR) system based on a grid
infrastructure is to perform Document Image Analysis, document processing of electronic
document formats converted from paper formats more effectively and efficiently. This improves
the accuracy of recognizing the characters during document processing compared to various
existing available character recognition methods. Here OCR technique derives the meaning of
the characters, their font properties from their bit-mapped images.
The primary objective is to speed up the process of character recognition in document
processing. As a result the system can process huge number of documents with-in less time
and hence saves the time.
Since our character recognition is based on a grid infrastructure, it aims to recognize multiple
heterogeneous characters that belong to different universal languages with different font
properties and alignments.
ABSTRACT
our need is to develop character recognition software system to perform Document Image
Analysis which transforms documents in paper format to electronic format. For this process there
are various techniques in the world. Among all those techniques we have chosen Optical
Character Recognition as main fundamental technique to recognize characters. The conversion of
paper documents in to electronic format is an on-going task in many of the organizations
particularly in Research and Development (R&D) area, in large business enterprises, in
government institutions, so on. From our problem statement we can introduce the necessity of
Optical Character Recognition in mobile electronic devices such as cell phones, digital cameras
to acquire images and recognize them as a part of face recognition and validation.
To effectively use Optical Character Recognition for character recognition in-order to perform
Document Image Analysis (DIA), we are using the information in Grid format. . This system is
thus effective and useful in Virtual Digital Librarys design and construction.
In the running world, there is growing demand for the software systems to recognize characters
in computer system when information is scanned through paper documents as we know that we
have number of newspapers and books which are in printed format related to different subjects.
These days there is a huge demand in storing the information available in these paper documents
in to a computer storage disk and then later reusing this information by searching process. One
simple way to store information in these paper documents in to computer system is to first scan
the documents and then store them as IMAGES. But to reuse this information it is very difficult
to read the individual contents and searching the contents form these documents line-by-line and
word-by-word. The reason for this difficulty is the font characteristics of the characters in paper
documents are different to font of the characters in computer system. As a result, computer is
unable to recognize the characters while reading them.
Thus our need is to develop character recognition software system to perform Document Image
Analysis which transforms documents in paper format to electronic format. For this process there
are various techniques in the world. Among all those techniques we have chosen Optical
Character Recognition as main fundamental technique to recognize characters. OCR thus derives
the meaning of characters, their font properties from their bit-mapped images.
To effectively use Optical Character Recognition for character recognition in-order to
perform Document Image Analysis (DIA), we are using the information in Grid format and
hence the Grid Technologies in character recognition. This system is thus effective and useful in
Virtual Digital Librarys design and construction.
SCOPE OF PROJECT
The scope of our product Optical Character Recognition on a grid infrastructure is to provide an
efficient and enhanced software tool for the users to perform Document Image Analysis,
document processing by reading and recognizing the characters in research, academic,
governmental and business organizations that are having large pool of documented, scanned
images. Irrespective of the size of documents and the type of characters in documents, the
product is recognizing them, searching them and processing them faster according to the needs
of the environment.
EXISTING SYSTEM
In the running world there is a growing demand for the users to convert the printed documents in
to electronic documents for maintaining the security of their data. Hence the basic OCR system
was invented to convert the data available on papers in to computer process able documents, So
that the documents can be editable and reusable. The existing system/the previous system of
OCR on a grid infrastructure is just OCR without grid functionality. That is the existing system
deals with the homogeneous character recognition or character recognition of single languages.
TECHNICAL REQUIREMENTS
2.1 SOFTWARE REQUIREMENTS SPECIFICATION
Operating System : Windows-XP
Programming Language : Core Java
User Interface : Swings
2.2 HARDWARE REQUIREMENTS SPECIFICATION
Processor : Pentium IV processor or higher
RAM : Minimum of 512 MB RAM
Memory : 500 MB or higher
PROPOSED METHODOLOGY
The Architecture of the optical character recognition system on a grid infrastructure consists of
the three main components. They are:-
Scanner
OCR Hardware or Software
Output Interface
BENEFIT OF PROPOSED SYSTEM
The benefit of proposed system that overcomes the drawback of the existing system is that it
supports multiple functionalities such as editing and searching. It also adds benefit by providing
heterogeneous characters recognition
TIME FRAME REQUIRED FOR VARIOUS STAGES OF PROJECT
IMPLEMENTATION
Sr. No. PHASES TIME DURATION
1. Synopsis ---- week
2. System Design ---- week
3. Coding ---- week
4. Implementation ---- week
5. Testing ---- week