3 M&a
3 M&a
By
Maryam Nabeel
Mohammed Ali
Ali Mohammed
Mustafa Dilshad
Supervisor
Mr. Farooq safaaldin
BSc
April 2024
Kirkuk, Iraq
Acknowledgements
We would like to express our deepest appreciation to all those who provided
the possibility to complete this research. A special gratitude we give to our
project supervisor, mr. Farooq Safaaldin, whose contribution in stimulating
suggestions and encouragement, helped us to coordinate our project
especially in writing this research.
A special thanks goes to our team, that we worked and helped each other to
assemble the parts and gave suggestions about the procedures of this project..
We are also grateful to Kirkuk Technical College for providing the laboratory
facilities.
Finally, we wish to thank our parents for their support and encouragement
throughout our study.
II
Abstract
III
Table of Contents
Contents
Acknowledgements..............................................................................................II
Abstract...............................................................................................................III
Table of Contents.................................................................................................X
Chapter 1 Introduction..........................................................................................1
4.1
…………………………………………………………………………………
…………………………………19
IV
5.1 back end........................................................................................................79
Database..............................................................................................................80
MySQL...............................................................................................................81
6.1 86
6.2 88
References...........................................................................................................91
V
Chapter 1 Introduction
6
Poor quality images with low resolution, blurred text, or skewing can make it
difficult for OCR software to recognize characters correctly. Character
Recognition Errors: OCR technology can sometimes misinterpret characters,
leading to errors in the output. For example, similar looking characters such
as 'l' and '1' or 'O' and '0' can be easily misinterpreted [1,8,10]. Formatting
Issues: OCR software can struggle with formatting issues, such as columns,
tables, and font styles, which can make it difficult to accurately recognize and
convert text language and Character Set Support: OCR software may not
support all languages and character sets, which can result in inaccurate
recognition of characters or an inability to recognize them at all.
Handwriting Recognition: OCR technology struggles with handwritten text
recognition. Even with the best OCR software, it's challenging to recognize
handwriting with high accuracy. Noise and Distortion: Noise and distortion in
the source image, such as smudges, stains, and creases, can cause OCR
technology to misinterpret characters or fail to recognize them at all.
Overall, OCR technology has made significant advancements in recent years,
but there are still several challenges that need to be addressed to improve its
accuracy and reliability [4].
1.2 Thesis Organisation
Optical Character Recognition (OCR) using Python provides an overview
of the various Python libraries and packages available for OCR, as well as the
current state of the art in OCR using Python. One of the most widely used
OCR libraries in Python is Tesseract, which is an open-source OCR engine
developed by Google. Tesseract provides a high level of ac- curacy and
supports a variety of languages and scripts, making it a popular choice for
OCR applications. The Python binding for Tesseract, ytesseract, provides a
simple interface for integrating Tesseract into Python pplications. Another
popular OCR library in Python is OpenCV, which is an open-source computer
vision library. OpenCV provides a range of image processing and computer
vision algorithms, including object detection and segmentation, which can be
used to improve the accuracy of OCR. The integration of OpenCV with
Tesseract or pytesseract provides a powerful tool for OCR applications. Other
OCR libraries in Python include OCRopus, a Python-based OCR engine
7
developed by Google, and pyOCR, a Python wrapper for the Tesseract OCR
engine. These libraries provide alternative options for OCR implementation in
Python and offer different levels of functionality and accuracy. In recent
years, there has been growing interest in the use of deep learning algorithms
for OCR.
Python provides a few deep learning libraries, such as TensorFlow and
PyTorch, which can be used to build OCR systems. These libraries allow for
the training of deep learning models for OCR and can be used to improve the
accuracy and efficiency of OCR systems. Overall, the literature survey
highlights the versatility of Python for OCR implementation, with a range of
libraries and packages available for OCR, including Tesseract and OpenCV,
as well as deep learning libraries such as TensorFlow and PyTorch. Python
provides a simple and flexible platform for OCR implementation, making it
an attractive option for OCR applications in a variety of domains.
Overall, the main aim and objectives of OCR technology are to automate the
recognition and conversion of text-based data, improving the efficiency,
accuracy, and accessibility of text-based information. OCR technology has
made significant advancements in recent years, but there are still several
challenges that need to be addressed to improve its accuracy and reliability
8
supports a variety of languages and scripts, making it a popular choice for
OCR applications. The Python binding for Tesseract,pytesseract, provides a
simple interface for integrating Tesseract into Python applications. Another
popular OCR library in Python is OpenCV, which is an open-source computer
vision library. OpenCV provides a range of image processing and com puter
vision algorithms, including object detection and segmentation, which can be
used to improve the accuracy of OCR.
TensorFlow and PyTorch. Python provides a simple and flexible platform for
OCR implementation, making it an attractive option for OCR applications in
a variety of domains. [1] Optical Character Recognition - Tesseract is Open
source OCR engine. It was initially developed between1984 to1994 at HP. In
1995, it was sent to UNLV for Annual Test of Optical Character Recognition
9
Accuracy after the joint project between HP Labs Bristol and HP‘s Scanner
Division in Colorado. Finally in 2005, Tesseract was released as open source
by HP an available at Tesseract OCR website. [2] Natural Language
Processing with Python: Analyzing text with Natural Language Toolkit. [3]
Information extraction and text summarization using linguistic knowledge
acquisition.- The lack of extensive linguistic coverage is the major barrier to
extracting useful information from large bodies of text. Current natural
language processing (NLP) systems do not have rich enough lexicons to cover
all the important words and phrases in extended texts that is all basically all of
the spoken language.
Chapter 3
Programming languages
Objective
The objective of this experiment was to develop and test a software system
capable of converting handwritten text and audio inputs into digital text
10
format. The system integrates Optical Character Recognition (OCR) and
voice recognition technologies to process images and sound files,
respectively.
Methodology
OCR Component
-s Optical Character R The preprocessed images were then fed into the OCR
engine to extract text.
-handwritten text The system was designed to accept voice inputs in various
forms, including human speech, robot-generated voices, and other sounds.
Experiment Execution
-) and voice recogniti A diverse dataset of images and audio files was
collected to test the system's capabilities.
-e of converti The OCR component was tested with the image dataset, while
the voice recognition component was tested with the audio files.
11
-t and audio inputs into di TheObjective
The objectlibrary was used to measure the accuracy of the text conversion by
comparing the system's output with a predefined ground truth.
Results
-nputs into digital text format. The voice recognition component accurately
converted Y% of the audio inputs into text, demonstrating its effectiveness on
a range of sound types.
-o digital text format. The system i The combined OCR and voice recognition
system showed a robust performance, with an overall accuracy of Z% in
converting both images and audio into digital text.
Discussion
This section would analyze the results, discussing the success rate of the
system, its limitations, and potential areas for improvement.
12
other technologies such as Natural Language Processing (NLP) to enhance
text comprehension and contextual analysis. The findings underscore the
potential of OCR in various sectors, suggesting that future developments
focus on increasing recognition precision, expanding language support, and
ensuring accessibility. The research concludes with recommendations for
future work, emphasizing the need for continuous innovation to meet the
growing demands of digitization in an increasingly data-driven world.
Chapter 5
5.1 back end
The main idea of this project came from a problem faced by many users in
copying the text content of published images. We must copy the terms
manually to obtain an accurate OCR (Optical Character Reader Software) that
is used to read characters from images. The image text section can be
screenshot and the characters from these images can be converted to editable
text form with the help of OCR software. This can be implemented as an
upgrade of existing media players.
13
having to retype it.
In this project, I am going to show some Python libraries that can allow you
to fastly extract text from images without struggling too much. The
explanation of the libraries is followed by a practical example. The dataset
used is taken from Kaggle. To simplify the concepts, I am just using an image
of the film Rush. The most important library in python used :
1. pytesseract: A Python wrapper for Google's Tesseract-OCR Engine. It
allows for the extraction of text from images.
2. tkinter: The standard Python interface to the Tk GUI toolkit. It is used for
creating graphical user interfaces.
4. fpdf: A library that allows for the creation of PDF files with Python.
14
commercial products.
However, there's a silver lining to this – thanks to strides made in AI, it's now
possible to streamline this task with code. Throw AI-fueled OCR algorithms
into the mix and one can efficiently and accurately translate image-based text
into accessible, actionable and searchable data.
This piece focuses on various types of images and the corresponding methods
required to extract text from them. We highlight the limitations of some
15
common approaches and offer practical solutions to enhance output. So, why
is it necessary to translate images to text?
The necessity for extracting text isn't just restricted to invoices. Other
important use cases include the digital conversion of recruitment forms,
resumes, healthcare records, food labels, ID document scans, and location-
specific images such as store names and street signs.
Images with a simple setup, sporting large text, limited words, simplistic
fonts, and clear contrast between text and images, may only require a few
lines of code. More complex images showcasing different fonts, noisy
backgrounds, shadowed or skewed text or handwritten text will likely prove
more challenging. Such images fight for extra coding efforts within a DIY
16
coding program. They demand preliminary processing of text prior to
extraction and further editing thereafter, to rectify text post-extraction.
For straightforward images, the ensuing methods are ideal. Tesseract and
OpenCV .Tesseract is a revered, open source OCR engine that assures
accurate text extraction from images. Its counterpart, the Open Source
Computer Vision Library (OpenCV), is a software library rooted in machine
learning that offers a variety of options and algorithms to work with videos
and images. Pairing Tesseract and OpenCV, users can extract data from
images using Python. After Tesseract is installed on the system, the
pytesseract library, a Python wrapper along with OpenCV, should be
installed. This is followed by simple steps to translate the text image into a
string using Tesseract. Another alternative in converting images to text is the
online service OnlineOCR. The easyOCR Fairly efficient and user-friendly,
easyOCR is a Python library that showcases a simple interface to extract text
from basic images. A brief command initializes text extraction. The readtext
method then returns a list of text detection results, easch containing extracted
text, bounding box coordinates, and a reliability score. Handling these results
is made easy with features allowing for text manipulation or printing.
There are other Python Libraries, besides pytesseract and easyOCR, there are
other Python libraries at our disposal that come with OCR capabilities to mine
text from images. They provide a cohesive interface to use these engines for
text extraction. Variations like PyOCR, OCRopus, provide supplementary
choices and flexibility in relation to OCR in Python. Some libraries can even
be used for both single-page and multi-page document OCR.
17
While they work wonders on basic images, open-source Python libraries
may encounter shortfalls when complex images come into play. They produce
inaccurate results if the background is pixelated, blurry or matches the text
color, or if dealing with an image is a handwritten or scanned copy. They
perform poorly if the image accommodates multiple columns or irregular text
placement. Also, they are not equipped with natural language processing
(NLP) features to check and improve output. If the input deviates from
standard, the Python libraries output incorrect results.
18
Chapter 6 Conclusions and
Recomendation
6.1 Recommendations:
Looking ahead, the future of OCR is poised for further innovation. The
integration of OCR with technologies like Natural Language Processing
(NLP) and Machine Learning (ML) will enhance its accuracy and efficiency.
Additionally, the development of OCR for less commonly used scripts and
languages will open new avenues for global information exchange. It is also
recommended that future research focuses on improving OCR's capability to
interpret handwritten texts and complex layouts, making it more versatile and
user-friendly.
19
6.2 Conclusions
Optical Character Recognition has been around for many years and has
become increasingly important as the amount of dig-ital information has
grown. The future of OCR development using Python looks very promising
as Python is a popular and widely used programming language for various
applications, including OCR. Here are a few areas where OCR development
using Python is expected to grow in the future: Improved Accuracy: With
advancements in deep learning and computer vi-sion, OCR algorithms will
continue to improve their accuracy in recognizing text in images and PDFs,
leading to even better performance. Real-Time OCR: As the demand for real-
time processing of images and videos increases, OCR systems will need to
adapt to real-time processing capabilities. Python's efficient programming and
ability to handle real-time data processing makes it a perfect choice for
developing real-time OCR systems. Multilingual OCR: As the world becomes
more connected and globalized, the demand for OCR systems that can handle
multiple languages will continue to grow. Python has strong support for
processing multiple languages, and this makes it an ideal platform for
multilingual OCR development. Handwrit-ing Recognition: With the
increasing use of digital devices for notetaking, the demand for OCR systems
that can recognize handwritten text will continue to grow.
20
programming language make it an ideal choice for developing these kinds of
applications.
21
crowdsourcing. In gen-eral, ongoing OCR research strives to increase OCR
speed, accuracy, and adaptability as well as make OCR available for a wider
variety of applications and languages. OCR technology is expected to become
more crucial as tasks related to digitaliza-tion, automation, and data analysis
progress.
References
22
Processing and Management, Volume 25,Issue 4 Page No -419-428.
[4] S. B. Kotsiantis, D. Kanellopoulos and P. E. Pintelas, Data Preprocessing
for Supervised Leaning‘, International Journal Of Computer Science Volume
1 Number 1 2006 ISSN.
[5] Jonathan Webster ,Chunya Kit, ―Tokenization as Initial phase in NLP,
City Polytechnic of Hong Kong,in proccedings of 14th Conference on
Computational Linguistics,Vol 4 page1106-1110.
[6]A Mitthal,P Kumarguru ―,Optical Character Recognition tool”,IIIT D
Dr. S. Vijayarani, Ms. J. Ilamathi, Ms. Nithya ,‘ Preprocessing Techniques for
Text Mining - An Overview‘ in International Journal of Computer Science &
Communication Networks,Vol 5(1),7-16.
[7] Meyer, David and Hornik, Kurt and Feinerer, Ingo (2008) Text Mining
Infrastructure in R. Journal of Statistical Software, 25 (5). pp. 1-54.
[8] Steven Bird,Edward Loper, NLTK : Natural Language Toolkit in
proceedings of Proceedings of the ACL 2004 on Interactive poster and
demonstration sessions, Article no 31.
[9] R. Smith. ―An overview of the Tesseract OCR Engine. Proc 9th Int.
Conf. on Document Analysis and Recognition, IEEE, Curitiba, Brazil, Sep
2007, pp629-633.
[10]The Tesseract open source OCR engine,http://code.google.com/p/tesseract-ocr.
[11] R.W. Smith, The Extraction and Recognition of Text from Multimedia
Document Images, PhD Thesis, University of Bristol, November 1987.
[12] Heuristic-Based OCR Post-Correction for Smart Phone Applications the
university of North Carolina at chapel hill department of computer science
honors thesis Author: WingSoon Wilson Lian 2009.
[13] Implementing Optical Character Recognition on the Android Operating
System for Business Cards By Sonia Bhaskar, Nicholas Lavassar, Scott Green
EE 368 Digital Image Processing.
9999999999999999999999999999999999999
هذه استخدمها
23
24