0% found this document useful (0 votes)
12 views12 pages

Document To Voice Converter For Blind: Dr. Meril Cyriac, Aani Shaji, Amritha MM, Avani Rajeev, Thara Thilak

This paper presents an advanced document-to-voice conversion tool designed to enhance accessibility for visually impaired individuals and non-native speakers by integrating features such as text extraction, summarization, translation, and natural-sounding speech output. The system utilizes Optical Character Recognition (OCR) for text extraction, AI-driven models for summarization, and translation APIs to ensure users can access information in their preferred language. By promoting independence and efficiency, this tool aims to create a more inclusive and equitable information landscape.

Uploaded by

Meril Cyriac
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views12 pages

Document To Voice Converter For Blind: Dr. Meril Cyriac, Aani Shaji, Amritha MM, Avani Rajeev, Thara Thilak

This paper presents an advanced document-to-voice conversion tool designed to enhance accessibility for visually impaired individuals and non-native speakers by integrating features such as text extraction, summarization, translation, and natural-sounding speech output. The system utilizes Optical Character Recognition (OCR) for text extraction, AI-driven models for summarization, and translation APIs to ensure users can access information in their preferred language. By promoting independence and efficiency, this tool aims to create a more inclusive and equitable information landscape.

Uploaded by

Meril Cyriac
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Zeichen Journal ISSN N0 : 0932 - 4747

DOCUMENT TO VOICE CONVERTER FOR BLIND


Dr. Meril Cyriac , Aani Shaji , Amritha MM , Avani Rajeev , Thara Thilak
1,2,3,4,5
Assistant Professor Deparment of Electronics and Communicaton Engineering.

LBS Institute of Technology For Women Thiruvananthapuram

Abstract

This paper develops an enhanced document-to-voice conversion tool to address key limitations
of current accessibility technologies. Traditional text-to-speech solutions often lack advanced
features like summarization and translation, making it difficult for users especially those who
are visually impaired or face language barriers to efficiently process information and access
content in multiple languages. Our tool integrates these essential functionalities: text extraction,
intelligent summarization, multi-language translation, and high-quality, natural-sounding
speech output. Designed with user accessibility and affordability in mind, this tool allows users
to convert lengthy or complex documents into concise, audio summaries, significantly reducing
the cognitive load and time required to understand content. The translation feature ensures that
users can seamlessly access multilingual materials, greatly expanding the scope of accessible
information. The inclusion of these advanced features enhances user experience, making this
tool beneficial for diverse users, including students, professionals, and organizations with
budget considerations. By combining affordability, accessibility, and usability, this paper
empowers a wide range of individuals to interact with written content more effectively and
independently. It provides an inclusive, user-centered platform that bridges accessibility gaps
and supports equal access to information, thus creating a more supportive and equitable
information landscape. This solution represents a critical step forward in accessibility
technology, offering a meaningful, practical tool that allows users to engage with information,
regardless of visual ability or language proficiency.

1.Introduction
Document-to-voice converters are essential for visually challenging individuals. They depend on
others for accessing books, articles and documents. But this document to voice converters
provide accessibility to written content, allowing users to access books, articles and documents
independently.

Volume 11 , Issue 02 , February 2025 Page N0 :1


Zeichen Journal ISSN N0 : 0932 - 4747
.

This promotes equal learning opportunities in educational settings also. It allows the users to
quickly process information and engage with the content dynamically, such as listening articles
or story books while multitasking. Integrating features like summarization and translation also
enhances the functionality by improving efficiency and accessibility, saving time and making
information more digestible. This also helps the non-native speakers to understand the contents
in their own language [1].

2. Problem Statement

Existing document to voice converters may lacks some additional features like summarization
and translation. It makes them less effective and leads to difficulties in processing information.
There arise a need for a better tool that combines text-to-speech with these additional features.
This solution would help the visually challenging individuals to access the articles, books and
also helps the non-native speakers to understand the contents in their own language. Thereby, it
provides a more inclusive environment for all users, regardless for their visual abilities or
language skills.

The goal of this paper is that by integrating additional features like summarization and
translation within the document to voice converter enables the visually impaired individuals to
quickly grasp key information in their preferred language. This approach enhances the
accessibility and engagement by offering a more efficient listening experience. Compatibility
with educational and professional platforms provides a wider usability and adoption across
diverse contexts. Integrating features like summarization and translation capabilities enhances
the user experience. Summarizing long documents to meaningful short paragraphs allows the
users to understand the key ideas contained in it and can save time and also translation allows
the non-native speakers to understand the contents in their own language. Also, it supports
multitasking by allowing the users to listen the content while engaging in other activities. It also
helpful for students who have visual impairments i.e., by providing tools that promote equal
opportunities in educational settings, help the visually challenging students to access course
materials more effectively. Moreover, the system enhances the efficiency and accessibility.

Volume 11 , Issue 02 , February 2025 Page N0 :2


Zeichen Journal ISSN N0 : 0932 - 4747

4.1 Specific Objectives

The proposed system revolve around integrating several advanced functionalities to develop a
robust and efficient document-to-speech solution. First, the system seeks to implement highly
accurate Optical Character Recognition (OCR) using tools such as docTR or Tesseract-OCR.
These tools are capable of extracting text from a wide range of document types, including
printed and handwritten materials, ensuring the system is adaptable to various text formats and
maintains a high level of accuracy. This feature enables users to digitize content from physical
documents seamlessly. Next, the system incorporates summarization capabilities using AI-
driven models like those provided by Hugging Face Transformers. This feature is designed to
process lengthy documents and condense them into concise, meaningful summaries. By focusing
on the core information, this functionality significantly reduces the time users spend on content
consumption while ensuring they retain the most critical insights.

The paper also integrates translation capabilities to support a multilingual user base. Using
libraries like Google Translate, the system ensures that users, particularly non-native speakers,
can access and understand content in their preferred language. This feature enhances
accessibility by breaking down language barriers and expanding the system’s usability across
different linguistic groups. Finally, the system includes a Text-to-Speech (TTS) module to
convert extracted or summarized text into speech. It utilizes tools such as pyttsx3 for offline use
and gTTS for natural-sounding audio when internet access is available. This functionality
ensures that visually impaired individuals can listen to content effortlessly, making the system
user-friendly and inclusive. These specific objectives aim to create a comprehensive solution that
seamlessly integrates OCR, summarization, translation, and TTS functionalities. The system is
optimized to provide high performance, accuracy, and accessibility, catering to diverse user
needs and enhancing their interaction with written content.

4.2 Broad Objectives

The proposed system aims to address critical challenges in accessibility, learning, efficiency,
inclusivity, and versatility by leveraging advanced technologies. One of its primary objectives is
to promote accessibility by bridging the gap between written and auditory communication,
making content more accessible to visually impaired individuals [2] .Furthermore, it enables
non-native speakers to understand written materials in their preferred language through
translation, breaking down language barriers and enhancing comprehension.

Volume 11 , Issue 02 , February 2025 Page N0 :3


Zeichen Journal ISSN N0 : 0932 - 4747

In terms of learning opportunities, the system seeks to provide equal access to educational
resources for visually impaired students by converting course content into audio format,
ensuring they can keep pace with their peers. It also supports non-native speakers in academic
and professional settings by offering language adaptability, which enhances their ability to
engage with course materials, research papers, and workplace documents without linguistic
limitations. The system is designed to improve user efficiency by summarizing lengthy
documents into concise and meaningful content, saving valuable time for users. It also enables
multitasking by providing an audio-based solution, allowing individuals to listen to content
while performing other activities, thus enhancing productivity in various scenarios. This
approach empowers users to explore books, articles, and other documents without relying on
external assistance, contributing to their independence and confidence. Finally, the system is
built with versatility in mind, ensuring compatibility across diverse platforms, including
educational institutions and professional environments. It is designed to be adaptable, allowing
for future enhancements, such as the integration of new features or deployment on embedded
devices. This flexibility ensures the system remains relevant and capable of meeting evolving
user needs in a dynamic technological landscape. By addressing these objectives, the system
aims to create a more inclusive and efficient way of accessing and interacting with written
content.

The proposed methodology offers a comprehensive approach to building a document-to-voice


conversion system. By integrating OCR, text cleaning, summarization, translation, and text-to-
speech technologies, the system effectively transforms written text into spoken language. The
modular design allows for flexibility and adaptability to different document formats and user
preferences. The system's ability to process documents, extract relevant information, and
present it in an auditory format makes it a valuable tool for users with visual impairments or
those who prefer auditory content consumption.

5.1 Document Scanning and Text Extraction

The system begins by digitizing the physical document. This is achieved through either scanning
the document using a scanner or capturing an image of it using a camera connected to the
computer. The acquired image is then fed into an Optical Character Recognition (OCR) engine.
This engine employs advanced algorithms to analyze the image pixel by pixel, identifying and
recognizing individual characters.

Volume 11 , Issue 02 , February 2025 Page N0 :4


Zeichen Journal ISSN N0 : 0932 - 4747

Once the characters are recognized, they are converted into digital text, effectively transforming
the scanned image into a machine-readable format. This extracted text can then be further
processed for tasks like summarization, translation, or text-to-speech conversion. [3] [4]

5.2 Text-to-Speech (TTS) Conversion

The extracted text is fed into a Text-to-Speech (TTS) engine, which transforms it into natural-
sounding spoken language. This engine employs sophisticated algorithms to analyze the text,
identify the appropriate pronunciation of words, and generate corresponding audio waveforms.
The user can control the initiation of the reading process by pressing a physical switch.This
switch sends a signal to the PC, triggering the TTS engine to start processing the text and
generating the audio output. The synthesized speech can be played through the system's
speakers or headphones, providing an auditory representation of the written content.

5.3 Language Translation

The extracted text, once cleaned and processed, can be translated into a desired language using
a language translation API. These APIs, such as Google Translate or DeepL, leverage advanced
machine learning techniques to accurately translate text from one language to another. By
integrating such an API into the system, users can access information in their preferred
language. To initiate the translation process, a dedicated switch can be incorporated. When this
switch is pressed, the system will trigger the translation API to translate the text. The translated
text is then fed into the TTS engine [8].

5.4 Summarization of Text

To further enhance the system's functionality, a summarization model can be integrated. This
model, such as those provided by Hugging Face's Transformers or LangChain, can process the
extracted text and generate a concise summary. This summary captures the key points of the
document, making it easier for users to quickly grasp the main ideas. The user can trigger the
summarization process by pressing a dedicated switch. Upon receiving this signal, the system
will feed the extracted text into the summarization model. The model will then process the text
and generate a summary. This summary can be either displayed on the screen or spoken aloud
using the TTS engine. By incorporating a summarization model, the system can provide a more
efficient and user-friendly experience, especially when dealing with lengthy documents [5] [6] [7].

Volume 11 , Issue 02 , February 2025 Page N0 :5


Zeichen Journal ISSN N0 : 0932 - 4747

5.5 Audio Output

The final stage of the process involves the audio output. Once the text-to-speech engine has
converted the processed text into audio waveforms, the system plays the generated audio
through the device's speakers or headphones. This auditory output provides a convenient and
accessible way for users to consume the information. The quality of the audio output is
influenced by factors such as the TTS engine's capabilities, the quality of the input text, and the
system's hardware. By optimizing these factors, a clear and natural-sounding audio experience
can be achieved.

Figure 1 . Flow chart for methodology.

6. Software Requirements

This paper is developed using Python and runs on Windows, though it can be executed on any
modern operating system that supports Python and the necessary libraries. The core
functionality of the project involves several key Python libraries. For OCR (Optical Character
Recognition), docTR is used as the primary tool, a deep learning-based OCR library that
extracts text from scanned images or documents [10] . If needed, Tesseract-OCR can also be
used as an alternative for traditional OCR tasks. For converting the extracted text into speech,
the paper utilizes pyttsx3, an offline Text-to-Speech (TTS) library, though gTTS (Google Text-
to-Speech) can be used for more natural-sounding voices when internet access is available [12].

Volume 11 , Issue 02 , February 2025 Page N0 :6


Zeichen Journal ISSN N0 : 0932 - 4747

The paper also integrates Google Translate a library that interfaces with Google Translate API,
enabling text translation into multiple languages [9]. For text summarization, Transformers
from Hugging Face is used to implement AI-based models for generating concise summaries of
long texts. Additionally, keyboard is employed to handle keyboard inputs, allowing users to
trigger actions like scanning, translating, or summarizing. The development is carried out in
PyCharm, a Python-compatible IDE that provides robust features for managing and debugging
the code. These libraries, combined with PyCharm, create a comprehensive and efficient
environment for handling text recognition, translation, summarization, and speech synthesis.
Developed in PyCharm, the paper benefits from an optimal development environment that
enhances productivity and ease of debugging. Overall, this solution provides a flexible and
scalable framework for automating and improving workflows that require text extraction,
translation, summarization, and speech output.

6. Results

The paper has successfully implemented document scanning to speech conversion and
summarization of the scanned document [Figure.6] [Figure.7] , leveraging docTR for Optical
Character Recognition (OCR) and pyttsx3 for Text-to-Speech (TTS) and from transformers
like T5Tokenizer, T5ForConditionalGeneration etc. [11]. The system can scan documents (both
images and printed text), extract the content using docTR's OCR capabilities, and then convert
the extracted text to speech with pyttsx3, which provides clear and intelligible audio output
through the system's speakers. It also provides a short summary of the scanned document. This
functionality makes the system highly accessible, particularly for blind users, as it allows them
to listen to the content of printed or image-based documents.

However, there are some challenges: the OCR accuracy is generally good for printed text
[Figure.2] [Figure.3], but the system struggles with handwritten documents, where recognition
accuracy is lower. [Figure.4] [Figure.5]. Despite this limitation, the speech output generated by
pyttsx3 is of high quality, ensuring a smooth and understandable user experience for text-to-
speech conversion. Moving forward, further improvements in handwritten text recognition or
integration with additional OCR models could enhance the system's robustness in diverse use
cases. For real-time implementation, the system needs several future enhancements. First,
optimizing OCR speed using GPU acceleration and image preprocessing is crucial for faster text
extraction. The text-to-speech engine can be upgraded to more advanced neural models for
natural and quicker voice output.

Volume 11 , Issue 02 , February 2025 Page N0 :7


Zeichen Journal ISSN N0 : 0932 - 4747

To minimize translation and summarization delays, local pre-trained models or efficient API
calls should be integrated. Additionally, using multithreading or asynchronous processing will
streamline the workflow, reducing overall latency. Upgrading hardware, such as faster
processors or adding a GPU, will further boost real-time performance. These improvements are
essential to achieve seamless real-time document-to-speech conversion.

Figure. 2 .Detection of printed image.

Figure 3 . Denoting confidence percentage.

Volume 11 , Issue 02 , February 2025 Page N0 :8


Zeichen Journal ISSN N0 : 0932 - 4747

Figure 4 . Detection of handwritten document

Figure 5 . Denoting the confidence percentage.

Volume 11 , Issue 02 , February 2025 Page N0 :9


Zeichen Journal ISSN N0 : 0932 - 4747

Figure 6. Extracted text

Figure 7. Generated Summary

Volume 11 , Issue 02 , February 2025 Page N0 :10


Zeichen Journal ISSN N0 : 0932 - 4747

This document-to-voice conversion project introduces an innovative solution to address the


challenges that many existing tools overlook. By incorporating advanced features such as text
summarization, language translation, and natural-sounding speech output, this tool goes
beyond basic text-to-speech functionalities to create a more comprehensive and inclusive
experience. It effectively reduces the cognitive load on users by providing them with
summarized content, which is especially valuable when dealing with lengthy or complex
documents. The integration of translation broadens accessibility further, enabling users to
interact with content in multiple languages seamlessly.

This tool is particularly impactful for visually impaired users, who often encounter barriers in
accessing written materials, as well as for individuals facing language barriers, enhancing their
ability to access and understand information independently. By ensuring these additional
features operate with ease and affordability, the tool is designed to serve a wide range of users,
making it accessible to students, professionals, and anyone needing improved document
accessibility.

This product, we aim to create a supportive, user-centered platform that promotes equal
access to information, regardless of visual or linguistic challenges. Ultimately, this tool fosters
a more inclusive environment by enabling all users to interact with information more
effectively and meaningfully, bridging gaps in accessibility and advancing the goal of a more
equitable information landscape.

Volume 11 , Issue 02 , February 2025 Page N0 :11


Zeichen Journal ISSN N0 : 0932 - 4747

[1] Singh, Anshika, and Sharvan Kumar Garg. "Comparative study of optical character
recognition using different techniques on scanned handwritten images." Micro-Electronics and
Telecommunication Engineering: Proceedings of 6th ICMETE 2022. Singapore: Springer
Nature Singapore, 2023. 411-420.

[2] Guravaiah, Koppala, et al. "Third eye: object recognition and speech generation for visually
impaired." Procedia Computer Science 218 (2023): 1144-1155.
[3] Batra, Pulkit, et al. "OCR-MRD: performance analysis of different optical character
recognition engines for medical report digitization." International Journal of Information
Technology 16.1 (2024): 447-455.
[4] Manju, S., and J. Anitha. "Investigation of Handwritten Image-To-Speech Using Deep
Learning." 2024 International Conference on Advances in Modern Age Technologies for
Health and Engineering Science (AMATHE). IEEE, 2024.
[5] Gupta, Anushka, et al. "Automated news summarization using transformers." Sustainable
Advanced Computing: Select Proceedings of ICSAC 2021. Singapore: Springer Singapore,
2022. 249-259.
[6] Bauboorally, S. M. W., & Pudaruth, S. (2023). A Statistical and Machine Learning
Approach for Summarising Computer Science Research Papers. International Journal of
Computing and Digital Systems.
[7] Adhikari, Surabhi. "Nlp based machine learning approaches for text summarization." 2020
Fourth International Conference on Computing Methodologies and Communication
(ICCMC). IEEE, 2020.
[8] Vieira, Lucas Nunes, et al. "Machine translation in society: insights from UK users."
Language Resources and Evaluation 57.2 (2023): 893-914.
[9] Kolhar, Manjur, and Abdalla Alameen. "Artificial Intelligence Based Language Translation
Platform." Intelligent Automation & Soft Computing 28.1 (2021).
[10] Porwal, Utkarsh, Alicia Fornés, and Faisal Shafait. "Advances in handwriting recognition."
International Journal on Document Analysis and Recognition (IJDAR) 25.4 (2022): 241-243.
[11] Raj, Ankit, et al. "Document-Based Text Summarization using T5 small and gTTS." 2024
International Conference on Advances in Data Engineering and Intelligent Computing Systems
(ADICS). IEEE, 2024.
[12] Sisman, Berrak, et al. "An overview of voice conversion and its challenges: From statistical
modeling to deep learning." IEEE/ACM Transactions on Audio, Speech, and Language
Processing 29 (2020): 132-157.

Volume 11 , Issue 02 , February 2025 Page N0 :12

You might also like