Text_Recognition_in_Images_and_Converting_Recognized_Text_to_Speech__Image_Processing
Text_Recognition_in_Images_and_Converting_Recognized_Text_to_Speech__Image_Processing
Adwit Singh
Department of Computer Science and
Engineering,
University of spain
[email protected]
Abstract— This review article provides an overview of of formats, including images, is essential. There has been
recent advancements in image-based text recognition and significant development in this field recently with the advent
converting recognized text to speech using image processing of machine learning, deep learning, and other cutting-edge
techniques. The article covers various techniques for text techniques. The objective of this survey of the literature is to
recognition in images, including Optical Character provide a comprehensive overview of the most recent
Recognition, Convolutional Neural Networks (CNN), and research on text recognition in images and converting
Recurrent Neural Networks. Additionally, the process of recognized text into speech, highlighting the most significant
converting recognized text to speech is discussed, including advancements and identifying any open-ended research
several Text-to-Speech (TTS) techniques such as concatenative
questions. The importance of text recognition in photographs
TTS, formant synthesis, and parametric synthesis.
is briefly discussed in the study's introduction.
Keywords— Text recognition, image processing, speech The discussion of different text recognition techniques
synthesis, machine learning, deep learning, convolutional neural follows, including standard Optical Character Recognition
networks, natural language processing. (OCR) techniques as well as machine learning and deep
learning techniques. In addition, the advantages and
I. INTRODUCTION
disadvantages of using convolutional neural networks
An important area of study in the realm of image (CNNs) and recurrent neural networks (RNNs) for text
processing is text recognition in images and text recognition recognition in images are covered in the article. Following is
to speech conversion. The most recent studies on text a review of well-known text-to-speech conversion
identification in images and text transcription from techniques, including concatenative synthesis, formant
recognized text are reviewed in this paper. The various synthesis, and statistical parametric synthesis. The
methods for text recognition and voice synthesis—including advantages and disadvantages of each method are also
machine learning, deep learning, convolutional neural discussed.
networks, and natural language processing—are covered in
the paper. This paper's goal is to give a summary of the most Before summarizing the most recent approaches for text
cutting-edge methods for voice synthesis and text recognition identification in images and speech generation, the study also
at the moment while also identifying any open questions that highlights the research gaps that need to be filled. It
need further investigation. highlights the need for more research into low-resolution
image identification and speaking naturally. There is a lot of
Text recognition in images and text to audio conversion potential in this area, and more research will be required to
are significant areas of study in image processing. The develop methods for speech synthesis and text recognition
creation of efficient methods for information extraction from that are more precise and efficient.
the vast amount of text data, which is accessible in a variety
The method of text-based image identification is also and their variations. We focus on the CNN technique and its
known as optical character recognition. (OCR). A computer many applications, including document analysis, analysis of
can recognize and extract text from images or scanned papers handwritten text, and text recognition in scene photos. We
using this technology. OCR is a technique for digitizing discuss the field's challenges and likely future directions
printed or handwritten text and turning it into a format that while highlighting CNN's possible impact on real-world
other software applications can use to find, edit, and display applications. Finding and extracting text from the images is
the content. required for text identification in photographs. Due to
differences in typeface, size, orientation, lighting, and other
OCR technology requires a number of processes. The elements that may have an impact on the image's quality, this
photograph is first preprocessed to highlight the text and can be a difficult task. Text recognition tests have
eliminate any ambiance. The text is then divided up and demonstrated the effectiveness of CNNs, [10-13]particularly
detected using algorithms for pattern recognition. The final when combined with additional methods like optical
output is generated when the detected text has been validated character recognition (OCR). The automatic learning of
and rectified.[19] features from the images made possible by CNNs can
increase the precision of text recognition.[14]
II. LITERATURE REVIEW
A. Using deep learning techniques, "Text Recognition in
Pictures" was published in 2018 by F. A. Rodriguez-Saona et
al. In this work, deep learning methods are used to look for
text in photographs. To increase the precision of text
recognition, the authors suggest a new model that combines
long short-term memory (LSTM) networks and
convolutional neural networks (CNNs). The suggested model
performs at the cutting edge on a number of benchmarks
after being trained on a sizable dataset of photos.
Fig. 1. OCR B . S. R. Singh and S. Ghosh's "A Study of Text
Detection and Recognition in Pictures" was published in
Text-to-speech (TTS) technology may translate the text 2019. The many methods for text detection and recognition
into speech after it has been identified. A technology called in photographs are summarized in this survey. The authors
TTS transforms written. Convolutional Neural Networks talk about the difficulties in text recognition and give a
(CNNs) have [17]achieved outstanding results in a variety of thorough breakdown of current developments. They also
computer vision applications, such as semantic segmentation, propose areas for more research by comparing the results of
object detection, and image classification. CNNs have also various algorithms on common benchmarks.
been used in recent years to recognize text in photos.
Traditional CNN models, on the other hand, need a lot of III. METHODOLOGY
memory and processing power, which makes them
unsuitable for devices with limited resources like A. .Working Principle
smartphones and embedded systems.[16] Using a picture as its input, a technique known as Text
recognition in images and converting detected text to voice
The novelty of creating a system that can recognize creates an audio file that reads the text that was found in the
handwritten or cursive text and transform it to speech in the image. The stages that such a converter takes to function are
same language, as well as creating a system that can as follows:[18]
recognize text in various languages. To solve this issue,
researchers have proposed a ground-breaking method dubbed 1) Image recognition:
CNN with Tensor Train decomposition. (CNN TT). This The process starts by using image recognition technology
method represents the CNN model more succinctly, resulting to identify the text present in the picture. This is achieved by
in a smaller memory footprint and a decrease in using techniques like Optical Character Recognition (OCR),
computational complexity without sacrificing speed.[15] which detaches the text from the image and formats it for
computer reading.
2) Text processing:
The text is cleaned up to remove any extraneous
characters, punctuation, or formatting after being extracted
from the picture. This procedure helps to guarantee that the
end audio file is easy to listen to and accurately conveys the
meaning of the original image.
3) Text-to-speech synthesis:
The text must now be converted into an audio file using
text-to-speech synthesis technology after it has been
modified. Typically, a clear and understandable computer-
Fig. 2. Process of OCR generated voice reads the material audibly.
This review study provides an overview of the most 4) Output:
current techniques for text detection in photos using CNNs
18
orized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 08,2025 at 12:58:19 UTC from IEEE Xplore. Restrictions ap
2023 10th IEEE Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON)
The user then hears the completed audio clip being c) Segmentation:
played back in order to hear the text being read aloud. This The OCR algorithm segments the picture into segments
may be particularly useful for people who have trouble that correspond to different characters or words.
reading or who have vision problems because it enables them
to access and understand the content of the image without d) Recognition:
relying on their vision. A sophisticated system called the The segmented words or characters are compared to a
image to text to speech converter turns photos correctly and database of known words or characters, and the most
rapidly into audio files by using text processing, text-to- probable match is chosen.
speech synthesis, and image recognition
e) Post-processing:
Any errors or inconsistencies in the identified text are
fixed to boost accuracy. OCR algorithms can differ in speed
and quality, and some work better with particular types of
text or languages than others. Thanks to advancements in
machine learning and artificial intelligence, OCR accuracy
has improved, and it is now more resistant to changes in font,
size, and image.
3) Text Processing :
Fig. 3. working of model
An image to text to speech converter is not complete
B. Algorithms without text processing, which involves converting the
detected text into a format that can be read audibly by a text-
1) Image Processing to-speech engine. Here are a few uses for text editing in this
picture processing can be used to improve the precision circumstance:
and efficiency of a picture to text to speech converter. The
following are some examples of image processing uses in a) Text normalization:
this circumstance: The identified text may contain a variety of grammatical
or spelling errors, such as typos, abbreviations, or incorrect
a) Image improvement: capitalization. Use text normalization to standardize the text
Before OCR, the image can be enhanced to improve and make it simpler to comprehend.
contrast, sharpness, and general visual quality. By doing this,
text output errors can be reduced and OCR systems can b) Text segmentation:
function more effectively. The identified text may be broken up into separate words
or sentences or it may simply be a continuous string of
b) disturbance reduction: characters. Text segmentation is a method for breaking up
Scratches and other types of picture disturbance can text into meaningful units, like words or sentences.
make OCR less accurate. Using image processing methods
like filtering and denoising, it is possible to lessen this noise c) Text processing
and improve OCR efficiency. Text processing can also involve adding markup or tags
to the text, which the text-to-speech engine can use to control
c) Text detection: how the material is spoken. For example, markup can be
Image processing can be used to identify and position used to specify how certain words or phrases should be
text in a photograph. By focusing only on the text and pronounced or to show where pauses or accents should be
disregarding the non-text areas like graphics or images, this placed.
can help OCR algorithms.
d) Language and voice selection:
d) Segmentation: Text processing may also entail selecting the appropriate
Image processing can be used to separate the text into language and voice for the text-to-speech output, depending
individual characters or words, improving OCR precision by on the user's preferences and the language of the recognized
reducing the likelihood that one character will be mistaken text.
for another.
4) Text-to-Speech (TTS) :
2) Optical Character Recognition (OCR) : The text-to-speech (TTS) function of an image to text to
Computers can read printed or handwritten text from voice converter converts recognized text into spoken words
images like scanned documents, photos, or screenshots that the viewer can hear. TTS can be used in this
thanks to a technique called optical character recognition circumstance in the following ways:
(OCR). The OCR algorithm typically includes several stages,
including: a) Speech synthesis:
TTS employs a speech synthesis engine to create human-
a) Pre-processing: like speaking from recognized text. The speech synthesis
The picture is enhanced and cleaned up to increase the engine may generate speech using a pre-recorded voice or
OCR's accuracy. This might involve adjusting the contrast, the text-to-speech synthesis algorithm.
aligning the text, or removing noise.
b) Voice selection:
b) Binarization: TTS gives users the option to pick the voice and language
By turning the picture black and white, the OCR system they prefer based on their preferences or the language of the
is better able to distinguish the text from the background. recognized text.[20]
19
orized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 08,2025 at 12:58:19 UTC from IEEE Xplore. Restrictions ap
2023 10th IEEE Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON)
c) Speech customization: [2] Tacotron: Toward end-to-end speech synthesis, Y. Wang et al.,
Proceedings of the International Speech Communication
TTS can also entail tailoring the output of speech to the Association's Annual Conference, INTERSPEECH, 2017.
user's requirements. For instance, the tempo, loudness, and [3] Text, Speech, and Conversation by A. Stepikhov, 2013.
pitch of the voice can be changed to enhance its naturalness [4] "Voice to Text Conversion Using Android Platform," International
and intelligibility. Journal of Engineering Research Applications, 2013.
d) Sounds output: [5] B. V. P. and P. Khilari, "A Review on Voice To Text Conversion
Techniques," International Journal of Advanced Research in
TTS generates sounds that can be heard through Computer Engineering Technology, 2015.
speakers or headphones. The audio output can be captured or [6] Towards end-to-end voice recognition with recurrent neural networks,
streamed in real time depending on the user's needs. A. Graves and N. Jaitly, networks," in ICML 2014, the 31st
International Conference on Machine Learning.
By integrating TTS into an image-to-text-to-speech [7] C. Herff et al., Front. Neurosci., 2015, "Brain-to-text: Decoding
converter, the identified text can be converted into spoken spoken sentences from phone representations in the brain."
words that the viewer can hear. This can make the converter [8] Bayesian jointsequence models for grapheme-to-phoneme conversion,
more accessible and usable for people who have vision J. Trmal, L. Ondel, S. Kesiraju, and L. Burget, ICASSP, IEEE
problems who might have difficulty reading text on a International Conference on Acoustics, Speech and Signal Processing
computer. - Proceedings, 2017.
[9] Deep voice 2: Multi-speaker neural text-to-speech, S. O. Arik et al.,
IV. RESULTS Advances in Neural Information Processing Systems, 2017.
[10] M. M. War, M. Rakhra and D. Singh, "Review On Application Based
Recent works in the field of text recognition in images Bus Tracking System," 2022 5th International Conference on
and converting recognized text to speech have shown Contemporary Computing and Informatics (IC3I), Uttar Pradesh,
promising results. Deep learning-based methods like CNNs India, 2022, pp. 876-880, doi: 10.1109/IC3I56241.2022.10072449.
and RNNs have outperformed traditional methods. [11] S. M. Makhdoomi, M. Rakhra, D. Singh and A. Singh, "Artificial-
Concatenative TTS techniques have been found to produce Intelligence based Prediction of Post-Traumatic Stress Disorder
(PTSD) using EEG reports," 2022 5th International Conference on
more natural-sounding speech than parametric TTS Contemporary Computing and Informatics (IC3I), Uttar Pradesh,
techniques. India, 2022, pp. 1073-1077, doi: 10.1109/IC3I56241.2022.10072671.
The difficult image processing issues of text recognition [12] R. S. Kushwaha, M. Rakhra, D. Singh and A. Singh, "An Overview:
Super-Image Resolution using Generative Adversarial Network for
in images and text to speech conversion have many real- Image Enhancement," 2022 5th International Conference on
world uses. OCR, deep learning-based methods, rule-based Contemporary Computing and Informatics (IC3I), Uttar Pradesh,
methods, and other approaches have all been suggested as India, 2022, pp. 1243-1246, doi: 10.1109/IC3I56241.2022.10072862.
solutions to these issues. Further studies are required to [13] T. Soewu, S. V. Uday Kalyan, M. Rakhra and D. Singh, "Lung
increase the precision and effectiveness of these methods Cancer Detection using Image Processing," 2022 5th International
despite recent studies' encouraging findings. OCR is a Conference on Contemporary Computing and Informatics (IC3I),
Uttar Pradesh, India, 2022, pp. 1206-1211, doi:
powerful tool in image processing that enables the automated 10.1109/IC3I56241.2022.10072589.
extraction of text from images, opening up a wide range of [14] C. Harika, D. Singh, A. Singh and M. Rakhra, "IoT Solution for
applications in various fields. Automatic Watering System," 2022 5th International Conference on
Contemporary Computing and Informatics (IC3I), Uttar Pradesh,
V. CONCLUSION India, 2022, pp. 1068-1072, doi: 10.1109/IC3I56241.2022.10073082.
Text recognition in images and text to speech translation [15] T. Soewu, Hemant, M. Rakhra and D. Singh, "Analysis of Data
are two important fields of study in image processing. The Mining-Based Approach for Intrusion Detection System," 2022 5th
International Conference on Contemporary Computing and
accuracy of text recognition has considerably improved as a Informatics (IC3I), Uttar Pradesh, India, 2022, pp. 908-912, doi:
result of recent advancements in deep learning techniques, 10.1109/IC3I56241.2022.10072828.
and excellent text-to-speech synthesis has been achieved [16] A. Ansari, B. Kaur, M. Rakhra, A. Singh and D. Singh, "Handwritten
using neural networks. These fields still require research, Text Recognition using Deep Learning Algorithms," 2022 4th
particularly in the area of developing algorithms that can International Conference on Artificial Intelligence and Speech
handle complex visuals and generate speech that sounds Technology (AIST), Delhi, India, 2022, pp. 1-6, doi:
10.1109/AIST55798.2022.10065348
genuine. Text recognition in images and text to speech
[17] R. Kumar Shukla, M. Rakhra, D. Singh and A. Singh, "The Role of
conversion are essential tools in the contemporary world. Machine Learning in Health Care Diagnosis," 2022 4th International
Numerous methods have been developed for text recognition Conference on Artificial Intelligence and Speech Technology (AIST),
in images, and deep learning-based methods have shown Delhi, India, 2022, pp. 1-6, doi: 10.1109/AIST55798.2022.10064906.
encouraging results. As TTS synthesis techniques have [18] A. Singh and M. Rakhra, "A Review For Different Sign Language
developed over time, it has been found that concatenative Recognition Systems," 2022 4th International Conference on
TTS techniques create more natural-sounding speech than Artificial Intelligence and Speech Technology (AIST), Delhi, India,
2022, pp. 1-6, doi: 10.1109/AIST55798.2022.10065037.
parametric TTS techniques. Future work can concentrate on
[19] M. K. Dath, M. Rakhra, D. Singh, A. Singh and R. Banala, "Basic
improving the accuracy of text detection in images and design for the implementation of automatic surveillance system on
developing more efficient TTS synthesis techniques. helmet detection," 2022 4th International Conference on Artificial
Intelligence and Speech Technology (AIST), Delhi, India, 2022, pp.
REFERENCES 1-5, doi: 10.1109/AIST55798.2022.10065367.
[1] Deep voice: Real-time neural text-to-speech, S. Arik et al., 34th [20] A. Sharma and D. Singh, "A Statistical Review on Machine Learning
International Conference on Machine Learning, 2017. Based Medical Diagnostic Systems for Chronic Kidney Disease,"
2022 3rd International Conference on Computation, Automation and
Knowledge Management (ICCAKM), Dubai, United Arab Emirates,
2022, pp. 1-5, doi: 10.1109/ICCAKM54721.2022.9990508.
20
orized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 08,2025 at 12:58:19 UTC from IEEE Xplore. Restrictions ap