0% found this document useful (0 votes)
28 views24 pages

3 M&a

Uploaded by

qyryy7jw5c
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views24 pages

3 M&a

Uploaded by

qyryy7jw5c
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Optical Character Recognition (OCR)

By
Maryam Nabeel
Mohammed Ali
Ali Mohammed
Mustafa Dilshad

Supervisor
Mr. Farooq safaaldin

A thesis submitted in partial fulfilment of the requirements for the degree of

Computer Engineering Techniques

Computer Engineering Techniques Department

Technical Engineering College

Northern Technical University

BSc

April 2024

Kirkuk, Iraq
Acknowledgements

We would like to express our deepest appreciation to all those who provided
the possibility to complete this research. A special gratitude we give to our
project supervisor, mr. Farooq Safaaldin, whose contribution in stimulating
suggestions and encouragement, helped us to coordinate our project
especially in writing this research.

Furthermore, we would also like to acknowledge with much appreciation the


crucial role of the staff of Computer Engineering Department, that gave the
permission to use all required equipment and the necessary materials to
conduct the research.

A special thanks goes to our team, that we worked and helped each other to
assemble the parts and gave suggestions about the procedures of this project..

We are also grateful to Kirkuk Technical College for providing the laboratory
facilities.

Finally, we wish to thank our parents for their support and encouragement
throughout our study.

II
Abstract

Optical Character Recognition (OCR) technology has revolutionized the way


we interact with textual data,
enabling the digitization of documents from various mediums such as scanned
paper, PDFs, or images captured by digital cameras into editable and
searchable formats.^[14]

The recent surge in OCR accuracy can be attributed to the advent of


sophisticated deep learning models, which have been meticulously trained on
expansive and diverse datasets to perform exceptionally well even in
challenging conditions with complex layouts and background noise.^[15]

These state-of-the-art models have undergone rigorous benchmarking


through a series of tests, demonstrating their robustness and versatility in
recognizing an array of text styles across different backgrounds.^[16]

By incorporating OCR into computational workflows, we unlock new


horizons for data analysis and breathe new life into historical documents,
transforming them from static images into dynamic, analyzable datasets that
can be processed and examined computationally.^[17]]

III
Table of Contents
Contents
Acknowledgements..............................................................................................II

Abstract...............................................................................................................III

Table of Contents.................................................................................................X

Chapter 1 Introduction..........................................................................................1

1.1 Problem Statement..........................................................................................3

1.2 Thesis Organisation........................................................................................3

Chapter 2 Literature Review.................................................................................5

Chapter 3 Programming languages.....................................................................10

Chapter 4 Experimental Work............................................................................19

4.1

…………………………………………………………………………………
…………………………………19

4.2 front end........................................................................................................21

Chapter 5 back end.............................................................................................79

IV
5.1 back end........................................................................................................79

Database..............................................................................................................80

MySQL...............................................................................................................81

5.2 database and XAMPP...................................................................................82

Chapter 6 Conclusions and Reco Conclusions:..................................................86

6.1 86

6.2 88

References...........................................................................................................91

V
Chapter 1 Introduction

1. History and Evolution of OCR


Optical Character Recognition (OCR) technology has a long and interesting
history. The first OCR systems were developed in the early 1900s, but they
were limited in their ability to recognize text accurately. In the 1960s, OCR
technology began to evolve rapidly with the advent of computers, leading to
the development of more advanced OCR systems. During the 1970s, OCR
systems became more sophisticated, incorporating features such as document
layout analysis and font recognition. In the 1980s, OCR technology continued
to advance, with the development of more advanced algorithms for character
recognition, as well as the introduction of OCR software for personal
computers. The 1990s saw the rise of digital imaging and the widespread
adoption of OCR technology in a variety of industries, including government,
finance, and healthcare [1-2].
In recent years, OCR technology has advanced significantly with the advent
of deep learning and machine learning algorithms. Today, OCR systems can
recognize text in a variety of languages and scripts and are used in a wide
range of applications, from document scanning and archiving to data
extraction and information retrieval. The development of OCR technology is
expected to further improve its accuracy and efficiency, making it an
increasingly important tool for businesses and individuals a like [2-6].
Some recent developments in the field of OCR include Improved Text
Recognition in Complex Scenes, Handwritten Text Recognition, Integration
with Augmented Reality, and Integration with Internet of Things (IoT). The
section of this paper is organized as follows, and part 2 contains related works
on optical character recognition history and its presence in python
programming language. In section 3, the methodology adopted with driving
code for the OCR process, section 4 reviews the mechanism of OCR engine
along with future scope of OCR development using python. Section 5
concludes the paper with future research.

1.1 Problem Statement


OCR (Optical Character Recognition) is a technology that allows the
recognition of text within digital images or scanned documents, and the
conversion of that text into machine-readable characters. While OCR
technology has made significant advancements in recent years, there are still
several common problems associated with its use, including: Quality of the
Source Image: The quality of the source image is critical for accurate OCR.

6
Poor quality images with low resolution, blurred text, or skewing can make it
difficult for OCR software to recognize characters correctly. Character
Recognition Errors: OCR technology can sometimes misinterpret characters,
leading to errors in the output. For example, similar looking characters such
as 'l' and '1' or 'O' and '0' can be easily misinterpreted [1,8,10]. Formatting
Issues: OCR software can struggle with formatting issues, such as columns,
tables, and font styles, which can make it difficult to accurately recognize and
convert text language and Character Set Support: OCR software may not
support all languages and character sets, which can result in inaccurate
recognition of characters or an inability to recognize them at all.
Handwriting Recognition: OCR technology struggles with handwritten text
recognition. Even with the best OCR software, it's challenging to recognize
handwriting with high accuracy. Noise and Distortion: Noise and distortion in
the source image, such as smudges, stains, and creases, can cause OCR
technology to misinterpret characters or fail to recognize them at all.
Overall, OCR technology has made significant advancements in recent years,
but there are still several challenges that need to be addressed to improve its
accuracy and reliability [4].
1.2 Thesis Organisation
Optical Character Recognition (OCR) using Python provides an overview
of the various Python libraries and packages available for OCR, as well as the
current state of the art in OCR using Python. One of the most widely used
OCR libraries in Python is Tesseract, which is an open-source OCR engine
developed by Google. Tesseract provides a high level of ac- curacy and
supports a variety of languages and scripts, making it a popular choice for
OCR applications. The Python binding for Tesseract, ytesseract, provides a
simple interface for integrating Tesseract into Python pplications. Another
popular OCR library in Python is OpenCV, which is an open-source computer
vision library. OpenCV provides a range of image processing and computer
vision algorithms, including object detection and segmentation, which can be
used to improve the accuracy of OCR. The integration of OpenCV with
Tesseract or pytesseract provides a powerful tool for OCR applications. Other
OCR libraries in Python include OCRopus, a Python-based OCR engine

7
developed by Google, and pyOCR, a Python wrapper for the Tesseract OCR
engine. These libraries provide alternative options for OCR implementation in
Python and offer different levels of functionality and accuracy. In recent
years, there has been growing interest in the use of deep learning algorithms
for OCR.
Python provides a few deep learning libraries, such as TensorFlow and
PyTorch, which can be used to build OCR systems. These libraries allow for
the training of deep learning models for OCR and can be used to improve the
accuracy and efficiency of OCR systems. Overall, the literature survey
highlights the versatility of Python for OCR implementation, with a range of
libraries and packages available for OCR, including Tesseract and OpenCV,
as well as deep learning libraries such as TensorFlow and PyTorch. Python
provides a simple and flexible platform for OCR implementation, making it
an attractive option for OCR applications in a variety of domains.

Overall, the main aim and objectives of OCR technology are to automate the
recognition and conversion of text-based data, improving the efficiency,
accuracy, and accessibility of text-based information. OCR technology has
made significant advancements in recent years, but there are still several
challenges that need to be addressed to improve its accuracy and reliability

Chapter 2 Literature Review


Optical Character Recognition (OCR) using Python provides an overview
of the various Python libraries and packages available for OCR, as well as the
current state of the art in OCR using Python. One of the most widely used
OCR libraries in Python is Tesseract, which is an open-source OCR engine
developed by Google. Tesseract provides a high level of accuracy and

8
supports a variety of languages and scripts, making it a popular choice for
OCR applications. The Python binding for Tesseract,pytesseract, provides a
simple interface for integrating Tesseract into Python applications. Another
popular OCR library in Python is OpenCV, which is an open-source computer
vision library. OpenCV provides a range of image processing and com puter
vision algorithms, including object detection and segmentation, which can be
used to improve the accuracy of OCR.

The integration of OpenCV with Tesseract or pytesseract provides a powerful


tool for OCR applications. Other OCR libraries in Python include OCRopus,
a Python-based OCR engine developed by Google, and pyOCR, a Python
wrapper for the Tesseract OCR engine [9-14]. These libraries provide
alternative options for OCR implementation in Python and offer different
levels of functionality and accuracy. In recent years, there has been growing
interest in the use of deep learning algorithms for OCR.

Python provides a number of deep learning libraries, such as TensorFlow and


PyTorch, which can be used to build OCR systems. These libraries allow for
the training of deep learning models for OCR and can be used to improve the
accuracy and efficiency of OCR systems. Overall, the literature survey
highlights the versatility of Python for OCR implementation, with a range of
libraries and packages available for OCR, including Tesseract and OpenCV,
as well as deep learning libraries such as

TensorFlow and PyTorch. Python provides a simple and flexible platform for
OCR implementation, making it an attractive option for OCR applications in
a variety of domains. [1] Optical Character Recognition - Tesseract is Open
source OCR engine. It was initially developed between1984 to1994 at HP. In
1995, it was sent to UNLV for Annual Test of Optical Character Recognition

9
Accuracy after the joint project between HP Labs Bristol and HP‘s Scanner
Division in Colorado. Finally in 2005, Tesseract was released as open source
by HP an available at Tesseract OCR website. [2] Natural Language
Processing with Python: Analyzing text with Natural Language Toolkit. [3]
Information extraction and text summarization using linguistic knowledge
acquisition.- The lack of extensive linguistic coverage is the major barrier to
extracting useful information from large bodies of text. Current natural
language processing (NLP) systems do not have rich enough lexicons to cover
all the important words and phrases in extended texts that is all basically all of
the spoken language.

Chapter 3
Programming languages

Chapter 4 Experimental Work

Objective

The objective of this experiment was to develop and test a software system
capable of converting handwritten text and audio inputs into digital text

10
format. The system integrates Optical Character Recognition (OCR) and
voice recognition technologies to process images and sound files,
respectively.

Methodology

OCR Component

- Pytesseract: An OCR engine was utilized to interpret text from images.


Theand test a softlibrary, which is a Python wrapper for Google's Tesseract-
OCR Engine, was employed.

-tem capable of converting Images were preprocessed using OpenCV to


enhance text recognition accuracy. This involved converting images to
grayscale, applying dilation and erosion, and saving the preprocessed images.

-s Optical Character R The preprocessed images were then fed into the OCR
engine to extract text.

Voice Recognition Component

-handwritten text The system was designed to accept voice inputs in various
forms, including human speech, robot-generated voices, and other sounds.

-udio inputs into digital A voice recognition module was implemented to


convert the audio input into text. The specifics of the voice recognition
technology used (e.g., a specific API or custom model) would be detailed
here.

Experiment Execution

-) and voice recogniti A diverse dataset of images and audio files was
collected to test the system's capabilities.

-e of converti The OCR component was tested with the image dataset, while
the voice recognition component was tested with the audio files.

11
-t and audio inputs into di TheObjective

The objecfrom Python's

The objectlibrary was used to measure the accuracy of the text conversion by
comparing the system's output with a predefined ground truth.

Results

-s into digital te The system achieved an accuracy of X% on the image


dataset, successfully recognizing handwritten text across various styles and
quality.

-nputs into digital text format. The voice recognition component accurately
converted Y% of the audio inputs into text, demonstrating its effectiveness on
a range of sound types.

-o digital text format. The system i The combined OCR and voice recognition
system showed a robust performance, with an overall accuracy of Z% in
converting both images and audio into digital text.

Discussion

This section would analyze the results, discussing the success rate of the
system, its limitations, and potential areas for improvement.

This research delves into the advancements of Optical Character Recognition


(OCR) technology and its transformative impact on digital information
processing. OCR has emerged as a pivotal tool in converting printed and
handwritten texts into machine-encoded text, enabling efficient data retrieval
and analysis. The study explores the evolution of OCR from basic character
recognition to its current state, where it incorporates artificial intelligence to
interpret text across diverse languages and formats. The research highlights
the challenges faced in recognizing cursive handwriting and non-standard
fonts, and how machine learning algorithms have significantly improved
accuracy rates. Furthermore, the paper discusses the integration of OCR with

12
other technologies such as Natural Language Processing (NLP) to enhance
text comprehension and contextual analysis. The findings underscore the
potential of OCR in various sectors, suggesting that future developments
focus on increasing recognition precision, expanding language support, and
ensuring accessibility. The research concludes with recommendations for
future work, emphasizing the need for continuous innovation to meet the
growing demands of digitization in an increasingly data-driven world.

Chapter 5
5.1 back end
The main idea of this project came from a problem faced by many users in
copying the text content of published images. We must copy the terms
manually to obtain an accurate OCR (Optical Character Reader Software) that
is used to read characters from images. The image text section can be
screenshot and the characters from these images can be converted to editable
text form with the help of OCR software. This can be implemented as an
upgrade of existing media players.

Optical character recognition (OCR) is a technology that recognizes text in


images, such as scanned documents and photos. Perhaps you’ve taken a photo
of a text just because you didn’t want to take notes or because taking a photo
is faster than typing it. Fortunately, thanks to smartphones today, we can
apply OCR so that we can copy the picture of text we took before without

13
having to retype it.

The Python OCR (PYTHON OPTICAL CHARACTER RECOGNITION)


used that is a technology that recognizes and pulls out text in images like
scanned documents and photos using Python. It can be completed using the
open-source OCR engine Tesseract.‫اكتب كل المكتبات المستخدمة ببايثون‬

Optical Character Recognition is an old, but still challenging problem that


involves the detection and recognition of text from unstructured data,
including images and PDF documents. It has cool applications in banking, e-
commerce and content moderation in social media. But as with everything
topic in data science, there is a huge amount of resources when trying to learn
how to solve the OCR task. This is why I am writing this tutorial, which can
help you on getting started.

In this project, I am going to show some Python libraries that can allow you
to fastly extract text from images without struggling too much. The
explanation of the libraries is followed by a practical example. The dataset
used is taken from Kaggle. To simplify the concepts, I am just using an image
of the film Rush. The most important library in python used :
1. pytesseract: A Python wrapper for Google's Tesseract-OCR Engine. It
allows for the extraction of text from images.

2. tkinter: The standard Python interface to the Tk GUI toolkit. It is used for
creating graphical user interfaces.

3. filedialog: A module in tkinter that provides classes and factory functions


for creating file/directory selection windows.

4. fpdf: A library that allows for the creation of PDF files with Python.

5. cv2 (OpenCV): An open-source computer vision and machine


learning software library. It provides a common infrastructure for computer
vision applications and accelerates the use of machine perception in

14
commercial products.

6. numpy: A fundamental package for scientific computing with Python. It


provides support for large, multi-dimensional arrays and matrices, along with
a collection of mathematical functions to operate on these arrays.

7. PIL (Pillow): The Python Imaging Library adds image processing


capabilities to your Python interpreter. Pillow is the friendly PIL fork and an
easy-to-use library developed for opening, manipulating, and saving many
different image file formats.

8. difflib: A module that provides classes and functions for comparing


sequences, including HTML and context and unified diffs.

5.2 How to Convert Image to Text Using Python

Leveraging artificial intelligence (AI) and optical character recognition


(OCR), it's possible to draw out text from an array of file formats. This
extraction can be further simplified with coding. Today, we delve into the
method of translating images to textual data using the powerful Python
programming language.

Organizations in the modern era are bombarded with a significant amount of


unstructured data in a myriad of formats – PDFs, scanned files, images, and
the like. Manual extraction of crucial textual information from these heaps of
data is a taxing task bound to result in errors and inefficiencies.

However, there's a silver lining to this – thanks to strides made in AI, it's now
possible to streamline this task with code. Throw AI-fueled OCR algorithms
into the mix and one can efficiently and accurately translate image-based text
into accessible, actionable and searchable data.

This piece focuses on various types of images and the corresponding methods
required to extract text from them. We highlight the limitations of some

15
common approaches and offer practical solutions to enhance output. So, why
is it necessary to translate images to text?

5.3 Why is Text Extraction Important?

Numerous entities churn out image data from operational documentation.


Sadly, this text encounters issues when it comes to viewing, editing, or
analysing it since it's not searchable. Hence, it becomes imperative to extract
or translate it into string data to capture and utilize it.

In the scenario of extracting invoice details, dates, supplier information,


amounts, and other textual information from invoice images- one can store
such data for auditing, tax purposes or to assess supplier performance.

The necessity for extracting text isn't just restricted to invoices. Other
important use cases include the digital conversion of recruitment forms,
resumes, healthcare records, food labels, ID document scans, and location-
specific images such as store names and street signs.

5.4 What Kinds of Images are Suitable for Text Extraction?

In Python, text extraction lends itself to all types of images theoretically


speaking. However, depending on expected outputs, the complexity of code
and accuracy may greatly differ. ‫يذكر النوع بالتحديد‬

Images with a simple setup, sporting large text, limited words, simplistic
fonts, and clear contrast between text and images, may only require a few
lines of code. More complex images showcasing different fonts, noisy
backgrounds, shadowed or skewed text or handwritten text will likely prove
more challenging. Such images fight for extra coding efforts within a DIY

16
coding program. They demand preliminary processing of text prior to
extraction and further editing thereafter, to rectify text post-extraction.

5.5 Translating Simple Images to Textual Data in Python

For straightforward images, the ensuing methods are ideal. Tesseract and
OpenCV .Tesseract is a revered, open source OCR engine that assures
accurate text extraction from images. Its counterpart, the Open Source
Computer Vision Library (OpenCV), is a software library rooted in machine
learning that offers a variety of options and algorithms to work with videos
and images. Pairing Tesseract and OpenCV, users can extract data from
images using Python. After Tesseract is installed on the system, the
pytesseract library, a Python wrapper along with OpenCV, should be
installed. This is followed by simple steps to translate the text image into a
string using Tesseract. Another alternative in converting images to text is the
online service OnlineOCR. The easyOCR Fairly efficient and user-friendly,
easyOCR is a Python library that showcases a simple interface to extract text
from basic images. A brief command initializes text extraction. The readtext
method then returns a list of text detection results, easch containing extracted
text, bounding box coordinates, and a reliability score. Handling these results
is made easy with features allowing for text manipulation or printing.

There are other Python Libraries, besides pytesseract and easyOCR, there are
other Python libraries at our disposal that come with OCR capabilities to mine
text from images. They provide a cohesive interface to use these engines for
text extraction. Variations like PyOCR, OCRopus, provide supplementary
choices and flexibility in relation to OCR in Python. Some libraries can even
be used for both single-page and multi-page document OCR.

5.6 Limitations of Python Libraries

17
While they work wonders on basic images, open-source Python libraries
may encounter shortfalls when complex images come into play. They produce
inaccurate results if the background is pixelated, blurry or matches the text
color, or if dealing with an image is a handwritten or scanned copy. They
perform poorly if the image accommodates multiple columns or irregular text
placement. Also, they are not equipped with natural language processing
(NLP) features to check and improve output. If the input deviates from
standard, the Python libraries output incorrect results.

5.7 Improving Python Libraries' Efficiency

The efficiency of Python libraries can be optimized by converting images.


Preliminary to text extraction, the image must be converted to grayscale or
black and white. Following this, grayscale can be evolved into a binary
format where text is shown as black pixels against a white background. To
augment efficiencies, additional code for image preprocessing can be written.
Common preprocessing tasks encompass applying filters to enhance clarity,
adjusting text and background contrast, correcting image skew or rotation,
normalizing varying text size, and more.

In essence, the conversion of images to text considerably enhances the


accessibility and productivity of any data-heavy business operation.
Benefiting from the power of Python libraries to streamline this process
further leverages the overall efficiency and accuracy of text extraction. Even
better, with the advent of AI and OCR technologies, the process is only
poised to get more streamlined and refined in the future.

18
Chapter 6 Conclusions and
Recomendation
6.1 Recommendations:

Looking ahead, the future of OCR is poised for further innovation. The
integration of OCR with technologies like Natural Language Processing
(NLP) and Machine Learning (ML) will enhance its accuracy and efficiency.
Additionally, the development of OCR for less commonly used scripts and
languages will open new avenues for global information exchange. It is also
recommended that future research focuses on improving OCR's capability to
interpret handwritten texts and complex layouts, making it more versatile and
user-friendly.

Investing in the continuous improvement of OCR technology will


undoubtedly yield significant benefits across numerous sectors, including
education, healthcare, finance, and legal industries, where data extraction and
analysis are crucial.

19
6.2 Conclusions

Optical Character Recognition has been around for many years and has
become increasingly important as the amount of dig-ital information has
grown. The future of OCR development using Python looks very promising
as Python is a popular and widely used programming language for various
applications, including OCR. Here are a few areas where OCR development
using Python is expected to grow in the future: Improved Accuracy: With
advancements in deep learning and computer vi-sion, OCR algorithms will
continue to improve their accuracy in recognizing text in images and PDFs,
leading to even better performance. Real-Time OCR: As the demand for real-
time processing of images and videos increases, OCR systems will need to
adapt to real-time processing capabilities. Python's efficient programming and
ability to handle real-time data processing makes it a perfect choice for
developing real-time OCR systems. Multilingual OCR: As the world becomes
more connected and globalized, the demand for OCR systems that can handle
multiple languages will continue to grow. Python has strong support for
processing multiple languages, and this makes it an ideal platform for
multilingual OCR development. Handwrit-ing Recognition: With the
increasing use of digital devices for notetaking, the demand for OCR systems
that can recognize handwritten text will continue to grow.

Python's ability to integrate with machine learning libraries like TensorFlow


and PyTorch makes it a great choice for devel-oping handwriting recognition
systems. Integration with Other Technologies: OCR technology will continue
to be integrated with other technologies like augmented reality, virtual reality,
and the Internet of Things (IoT) to create new and innovative applications.
Python's ability to integrate with various technologies and its popularity as a

20
programming language make it an ideal choice for developing these kinds of
applications.

The topic of OCR (Optical Character Recognition) technology is one that is


fast developing, and numerous ongoing research projects are working to
increase the accuracy, speed, and adaptability of OCR. Here are some of the
most recent OCR re-search trends: Deep learning-based OCR: Systems that
can accurately recognize characters and words are being developed using
deep learning algorithms like CNNs and RNNs. To enhance OCR
performance, researchers are experimenting with new deep learning
architectures, training methodologies, and data augmentation strategies.
Multimodal OCR: To increase OCR accuracy and make input methods more
adaptable, multimodal OCR systems combine image recognition with speech
recog-nition or natural language processing. To improve OCR performance,
researchers are investigating new multimodal OCR de-signs, such as
attention-based models. OCR systems that can recognize and translate text
from a variety of languages are becoming more and more crucial in today's
globalized society. Using methods like language modelling, cross-lingual
transfer learning, and neural machine translation, researchers are creating
multilingual OCR systems. OCR for low-resource languages: OCR systems
for low-resource languages encounter a number of difficulties, including a
lack of standardization and a lack of training data. Researchers are
investigating techniques, such as transfer learning and unsupervised learning,
to adapt current OCR systems to low-resource languages. OCR systems for
historical documents confront a number of difficulties, including
deterioration, noise, and differences in writing styles. The accuracy of OCR
on historical documents is being improved by re-searchers using techniques
like picture enhancement, character identification based on context, and

21
crowdsourcing. In gen-eral, ongoing OCR research strives to increase OCR
speed, accuracy, and adaptability as well as make OCR available for a wider
variety of applications and languages. OCR technology is expected to become
more crucial as tasks related to digitaliza-tion, automation, and data analysis
progress.

With this work, we have cultivated the application development project


(OCR) utilizing python. We utilized the famous libraries that are used to
extract text data from images, docs, website’s URLs etc. We utilized python
libraries like: Apache tika, requests, warnings, pytesseract, PIL, os, io, pypdf,
pdfplumber, flask, open-cv, pymupdf, scikit-learn scipy matplotlib, youtube-
dl and shutil too. In this paper, we introduced Python as a practical language
for instruction and practical program-ming. We also observed the Python-
introduced characteristics, features, and types of programming assistance. In
agreement with these qualities, we discovered Python to be a quick, amazing,
versatile, basic, open-source language that maintains nu-merous
advancements. Then, various Python projects of different types were bought.
The report has similarly examined how a significant section of Python is
being used by various associations. According to facts gathered from well-
known and relia-ble journals and locations, the paper has discussed the
reasons why Python is the fastest creating programming language.

References

[1]J. L. Lions, “ARIANE 5 Flight - 501 Failures Report,” 2010.


[2] The Tesseract open source OCR engine
http://code.google.com/p/tesseract-ocr.
[3] Lisa F Rau,Paul S Jacobs,Uri Zernik , Information Extraction and Text
Summarisation using linguistic knowledge acquisition‘ in Information

22
Processing and Management, Volume 25,Issue 4 Page No -419-428.
[4] S. B. Kotsiantis, D. Kanellopoulos and P. E. Pintelas, Data Preprocessing
for Supervised Leaning‘, International Journal Of Computer Science Volume
1 Number 1 2006 ISSN.
[5] Jonathan Webster ,Chunya Kit, ―Tokenization as Initial phase in NLP,
City Polytechnic of Hong Kong,in proccedings of 14th Conference on
Computational Linguistics,Vol 4 page1106-1110.
[6]A Mitthal,P Kumarguru ―,Optical Character Recognition tool”,IIIT D
Dr. S. Vijayarani, Ms. J. Ilamathi, Ms. Nithya ,‘ Preprocessing Techniques for
Text Mining - An Overview‘ in International Journal of Computer Science &
Communication Networks,Vol 5(1),7-16.
[7] Meyer, David and Hornik, Kurt and Feinerer, Ingo (2008) Text Mining
Infrastructure in R. Journal of Statistical Software, 25 (5). pp. 1-54.
[8] Steven Bird,Edward Loper, NLTK : Natural Language Toolkit in
proceedings of Proceedings of the ACL 2004 on Interactive poster and
demonstration sessions, Article no 31.
[9] R. Smith. ―An overview of the Tesseract OCR Engine. Proc 9th Int.
Conf. on Document Analysis and Recognition, IEEE, Curitiba, Brazil, Sep
2007, pp629-633.
[10]The Tesseract open source OCR engine,http://code.google.com/p/tesseract-ocr.
[11] R.W. Smith, The Extraction and Recognition of Text from Multimedia
Document Images, PhD Thesis, University of Bristol, November 1987.
[12] Heuristic-Based OCR Post-Correction for Smart Phone Applications the
university of North Carolina at chapel hill department of computer science
honors thesis Author: WingSoon Wilson Lian 2009.
[13] Implementing Optical Character Recognition on the Android Operating
System for Business Cards By Sonia Bhaskar, Nicholas Lavassar, Scott Green
EE 368 Digital Image Processing.

[14]. Papers With Code. A comprehensive overview of OCR technologies,


benchmarks, and datasets.

[15]. Hegghammer. Benchmarking experiment comparing the performance of


Tesseract, Amazon Textract, and Google Document AI.

[16]. Springer. Detailed review on text extraction using OCR.


[17]. Academia.edu. Extensive overview of recent OCR research

9999999999999999999999999999999999999
‫هذه استخدمها‬

23
24

You might also like