0% found this document useful (0 votes)
67 views37 pages

Audext Project

The document describes a student project that developed a web application called AUDEXT for converting text to audio and vice versa. AUDEXT allows users to enter text or record audio and have it converted to the other format. It was created using HTML, CSS, and Bootstrap for the interface and utilizes natural language processing and digital signal processing to perform the conversions. The project aims to help those with visual or physical impairments to communicate more easily.

Uploaded by

Dance on floor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views37 pages

Audext Project

The document describes a student project that developed a web application called AUDEXT for converting text to audio and vice versa. AUDEXT allows users to enter text or record audio and have it converted to the other format. It was created using HTML, CSS, and Bootstrap for the interface and utilizes natural language processing and digital signal processing to perform the conversions. The project aims to help those with visual or physical impairments to communicate more easily.

Uploaded by

Dance on floor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

AUDIO TO TEXT & TEXT TO AUDIO CONVERTER FOR

USER AT THE WEBSERVER


A

Major Project Report

Submitted in partial fulfillment of the requirement for the award of Degree of

BACHELOR OF TECHNOLOGY

In

COMPUTER SCIENCE & ENGINEERING

Submitted to

Dr. APJ Abdul Kalam Technical University


LUCKNOW (U.P.)

Submitted by

Shivam (1838310042)
Adarsh Kumar Tiwari (1838310003)

Under the Supervision of

Mr. Anubhava Srivastava


Assistant Professor
Computer Science & Engineering Department

Rajarshi Rananjay Sinh Institute of Management & Technology


Session 2021-22
Rajarshi Rananjay Sinh Institute of Management &

Technology
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

CERTIFICATE
This is to certify that Shivam (1838310042), Adarsh Kumar tiwari
(1838310003) of B.Tech. Final year, Computer Science & Engineering have
completed their major project entitled Audio to text and text to audio
converter during the year 2021-2022 under my guidance and supervision.

I approve the project for the submission for the partial fulfillment of the
requirement for the award of degree Bachelor of Technology in Computer
Science & Engineering.

Approved by:

Mr. Anubhav Srivastava (Ajay Misra)


Assistant Professor Head
CSE & IT CSE&IT Department
Department E-mail:
RRSIMT, Amethi [email protected]
RRSIMT, Amethi
Rajarshi Rananjay Sinh Institute of Management &

Technology

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

DECLARATION BY CANDIDATE

We, Shivam (1838310042), Adarsh Kumar Tiwari (1838310003),


students of Bachelor of Technology, Computer Science & Engineering,
Rajarshi Rananjay Sinh Institute of Management & Technology, Amethi
hereby declare that the work presented in this Major project entitled Audext is
outcome of our own work, is bonafide, correct to the best of our knowledge and
this work has been carried out taking care of Engineering Ethics. The work
presented does not infringe any patented work and has not been submitted to
any University for the award of any degree or professional diploma.

Shivam (1838310042)
Adarsh Kumar Tiwari (1838310003)
PREFACE
I have made this report file on AUDEXT. I have tried my best to elucidate all the relevant
detail to the topic to be included in the report. While in the beginning I have tried to give a
general view about this topic. My efforts and wholehearted co-corporation of each and
everybody has ended on a successful note 1 express my sincere gratitude to Ms. Anurag
singh(assistant pro. Cs & it) who assist me through out the preparation of this topic. I thank
her for providing me the reinforcement confidence and most importantly the track for the
topic whenever I needed it.
ACKNOWLEDGEMENT

I would like to thank respected Mr. Ajay misra (HOD CS & IT Dept.) and Mr. Anurag
Singh(Assistant professor CS & IT) for giving me such a wonderful opportunity to expand
my knowledge for my own branch and giving me guidelines to build a project report. It
helped me a lot to realize of what we study for Secondly, I would like to thank my parents
who patiently helped me as I went through my work and helped to modify and eliminate
some of the irrelevant or un-necessary stuffs. Thirdly, I would like to thank my friends who
helped me to make my work more organized and well-stacked till the end. Next, I would
thank Microsoft for developing such a wonderful tool like MS Word. It helped my work a lot
to remain error-free Last but clearly not the least, I would thank The Almighty for giving me
strength to complete my report on time.

Shivam
(1838310042)

Adarsh Kumar Tiwari


(1838310003)
TABLE OF CONTENT
1. Introduction 1

2. Statement of Problem 2

3. Why Particular Choice 3

4. Objectives of 4

5. Methodology 5

6. Process Description 6

7. Facilities Required For Project 7

8. Testing Technology & Security Mechanism 9

9. Contribution of Your Project 10

10. Limitation 11

11. Conclusion 12

12. Screen shorts of project 17

13. Source code 26

14. References 28
Abstract

The paper titled “AUDEXT “is TEXT-TO-AUDIO and AUDIO-TO-TEXT


converter for User at the webserver. The website "AUDEXT" is developed in
HTML, CSS, BOOTSTRAP which mainly focuses on basic operations like user
login , contact page, converter page, social media page, different converter page
"AUDEXT" is a web application written for 64-bit Windows operating systems,
designed to help users in order to convert text into audio and audio into text.
This Application is easy to use for physically disabled people who faces
problem in communication.

A TEXT-TO-AUDIO is an application that converts text into spoken word, by


analyzing and processing the text using Natural Language Processing (NLP)
and then using Digital Signal Processing (DSP) technology to convert this
processed text into synthesized speech representation of the text. Here, we
developed a useful text-to-speech synthesizer in the form of a simple
application that converts inputted text into synthesized speech and reads out to
the user which can then be saved as an mp3.file. The development of a text to
speech synthesizer will be of great help to people with visual impairment and
make making through large volume of text easier.

AUDIO-TO-TEXT is an application that lets the user control computer


functions and dictates text by voice. The system consists of two components,
first component is for processing acoustic signal which is captured by a
microphone and second component is to interpret the processed signal, then
mapping of the signal to words . Home automation will be completely based on
voice recognition system.
Introduction

AUDIO-TO-TEXT

Voice is the basic, common and efficient form of communication method for
people to interact with each other. Today speech technologies are commonly
available for a limited but interesting range of task. This technologies enable
machines to respond correctly and reliably to human voices and provide useful
and valuable services. As communicating with computer is faster using voice
rather than using keyboard, so people will prefer such system. Communication
among the human being is dominated by spoken language, therefore it is
natural for people to expect voice interfaces with computer. This can be
accomplished by developing voice recognition system:speech-to-text which
allows computer to translate voice request and dictation into text. Voice
recognition system:speech-to-text is the process of converting an acoustic
signal which is captured using a microphone to a set of words. The recorded
data can be used for document preparation.

TEXT-TO-AUDIO

Text-to-speech synthesis -TTS - is the automatic conversion of a text into


speech that resembles, as closely as possible, a native speaker of the language
reading that text. Text-tospeech synthesizer (TTS) is the technology which lets
computer speak to you. The TTS system gets the text as the input and then a
computer algorithm which called TTS engine analyses the text, pre-processes
the text and synthesizes the speech with some mathematical models. The TTS
engine usually generates sound data in an audio format as the output. The text-
to-speech (TTS) synthesis procedure consists of two main phases. The first is
text analysis, where the input text is transcribed into a phonetic or some other
linguistic representation, and the second one is the generation of speech
waveforms, where the output is produced from this phonetic and prosodic
information. These two phases are usually called high and low-level synthesis
[1]. A simplified version of this procedure is presented in figure 1 below. The
input text might be for example data from a word processor, standard ASCII
from e-mail, a mobile text-message, or scanned text from a newspaper. The
character string is then pre-processed and analyzed into phonetic representation
which is usually a string of phonemes with some additional information for
correct intonation, duration, and stress. Speech sound is finally generated with
the low-level synthesizer by the information from high-level one. The artificial
production of speech-like sounds has a long history, with documented
mechanical attempts dating to the eighteenth century.
OVERVIEW

1. Overview of Voice Recognition System : Audio-to-text


Input signal- Voice input by the user.

Feature Extraction- It should retain useful information of the signal,


deduct redundant and unwanted information, show less variation from one
speaking environment to another, occur normally and naturally in speech.
Acoustic model- it contains statistical representations of each distinct sounds
that makes up a word.

Decoder- It will decode the input signal after feature extraction and will show
the desired output. Language model- it assigns a probability to a sequence of
words by means of a probability distribution.

Output- interpreted text is given by the computer.

The main of the project is to recognize speech using MFCC and VQ techniques.
The feature extraction will be done using Mel Frequency Cepstral
Coefficients(MFCC). The steps of MFCC are as follows:-

1) Framing and Blocking

2) Windowing

3) FFT(Fast Fourier Transform)

4) Mel-Scale

5) Discrete Cosine Transform(DCT)

Feature matching will be done using Vector Quantization technique. The steps
are as follows:-

1) By choosing any two dimensions, inspection on vectors is done and data


points are plotted.

2) To check whether data region for two different speaker are overlapping
each other and in same cluster, observation is needed.

3) Using LGB algorithm Function Vqlbg will train the VQ codebook.


The extracted features will be stored in .mat file using MFCC algorithm.
Models will be created using Hidden Markov Model(HMM). The desired output
will be shown in matlab interface.
2. OVERVIEW OF TEXT-TO-AUDIO SYNTHESIS
Speech synthesis can be described as artificial production of human speech. A
computer system used for this purpose is called a speech synthesizer, and can be
implemented in software or hardware. A text-to-audio (TTA) system converts
normal language text into speech. Synthesized speech can be created by
concatenating pieces of recorded speech that are stored in a database. Systems
differ in the size of the stored speech units; a system that stores phones or
diphones provides the largest output range, but may lack clarity. For specific
usage domains, the storage of entire words or sentences allows for high-quality
output. Alternatively, a synthesizer can incorporate a model of the vocal tract
and other human voice characteristics to create a completely "synthetic" voice
output.

The quality of a speech synthesizer is judged by its similarity to the human


voice and by its ability to be understood. An intelligible text-to-speech program
allows people with visual impairments or reading disabilities to listen to written
works on a home computer. A text-to-audio system (or "engine") is composed
of two parts: a front-end and a back-end. The front-end has two major tasks.
First, it converts raw text containing symbols like numbers and abbreviations
into the equivalent of written-out words. This process is often called text
normalization, preprocessing, or tokenization. The front-end then assigns
phonetic transcriptions to each word, and divides and marks the text into
prosodic units, like phrases, clauses, and sentences. The process of assigning
phonetic transcriptions to words is called text-to-phoneme or grapheme-to-
phoneme conversion. Phonetic transcriptions and prosody information together
make up the symbolic linguistic representation that is output by the front-end.
The back-end—often referred to as the synthesizer—then converts the symbolic
linguistic representation into sound. In certain systems, this part includes the
computation of the target prosody (pitch contour, phoneme durations), which is
then imposed on the output speech.

There are different ways to perform speech synthesis. The choice depends on
the task they are used for, but the most widely used method is Concatentive
Synthesis, because it generally produces the most natural-sounding synthesized
speech. Concatenative synthesis is based on the concatenation (or stringing
together) of segments of recorded speech. There are three major sub-types of
concatenative synthesis:
Domain-specific Synthesis: Domain-specific synthesis concatenates pre-
recorded words and phrases to create complete utterances. It is used in
applications where the variety of texts the system will output is limited to a
particular domain, like transit schedule announcements or weather reports. The
technology is very simple to implement, and has been in commercial use for a
long time, in devices like talking clocks and calculators. The level of
naturalness of these systems can be very high because the variety of sentence
types is limited, and they closely match the prosody and intonation of the
original recordings. Because these systems are limited by the words and phrases
in their databases, they are not general-purpose and can only synthesize the
combinations of words and phrases with which they have been pre-
programmed. The blending of words within naturally spoken language however
can still cause problems unless many variations are taken into account. For
example, in nonrhotic dialects of English the "r" in words like "clear" /ˈklɪə/ is
usually only pronounced when the following word has a vowel as its first letter
(e.g. "clear out" is realized as /ˌklɪəɾˈʌʊt/). Likewise in French, many final
consonants become no longer silent if followed by a word that begins with a
vowel, an effect called liaison. This alternation cannot be reproduced by a
simple word-concatenation system, which would require additional complexity
to be context-sensitive. This involves recording the voice of a person speaking
the desired words and phrases. This is useful if only the restricted volume of
phrases and sentences is used and the variety of texts the system will output is
limited to a particular domain e.g. a message in a train station, whether reports
or checking a telephone subscriber’s account balance. .

Unit Selection Synthesis: Unit selection synthesis uses large databases of


recorded speech. During database creation, each recorded utterance is
segmented into some or all of the following: individual phones, diphones , half-
phones, syllables, morphemes, words, phrases, and sentences. Typically, the
division into segments is done using a specially modified speech recognizer set
to a "forced alignment" mode with some manual correction afterward, using
visual representations such as the waveform and spectrogram. An index of the
units in the speech database is then created based on the segmentation and
acoustic parameters like the fundamental frequency (pitch), duration, position in
the syllable, and neighboring phones. At runtime, the desired target utterance is
created by determining the best chain of candidate units from the database (unit
selection). This process is typically achieved using a specially weighted
decision tree.
Unit selection provides the greatest naturalness, because it applies only a small
a90-mount of digital signals processing (DSP) to the recorded speech. DSP
often makes recorded speech sound less natural, although some systems use a
small amount of signal processing at the point of concatenation to smooth the
waveform. The output from the best unit-selection systems is often
indistinguishable from real human voices, especially in contexts for which the
TTS system has been tuned. However, maximum naturalness typically require
unit selection speech databases to be very large, in some systems ranging into
the gigabytes of recorded data, representing dozens of hours of speech.. Also,
unit selection algorithms have been known to select segments from a place that
results in less than ideal synthesis (e.g. minor words become unclear) even
when a better choice exists in the database.

Diphone Synthesis: Diphone synthesis uses a minimal speech database


containing all the diphones (sound-to-sound transitions) occurring in a
language. The number of diphones depends on the phonotactics of the language:
for example, Spanish has about 800 diphones, and German about 2500. In
diphone synthesis, only one example of each diphone is contained in the speech
database. At runtime, the target prosody of a sentence is superimposed on these
minimal units by means of digital signal processing techniques such as linear
predictive coding, PSOLA or MBROLA. The quality of the resulting speech is
generally worse than that of unit-selection systems, but more natural-sounding
than the output of formant synthesizers. Diphone synthesis suffers from the
sonic glitches of concatenative synthesis and the robotic-sounding nature of
formant synthesis, and has few of the advantages of either approach other than
small size. As such, its use in commercial applications is declining, although it
continues to be used in research because there are a number of freely available
software implementations.
Structure of A Text-To-Speech System

Text-to-speech takes place in several steps. The TTS systems get a text as input,
which it first must analyze and then transform into a phonetic description. Then
in a further step it generates the prosody. From the information now available, it
can produce a speech signal. The structure of the text-to-speech synthesizer can
be broken down into major modules:

 Natural Language Processing (NLP) module: It produces a phonetic


transcription of the text read, together with prosody.

 Digital Signal Processing (DSP) module: It transforms the symbolic


information it receives from NLP into audible and intelligible speech. The
major operations of the NLP module are as follows:

 Text Analysis: First the text is segmented into tokens. The token-to-word
conversion creates the orthographic form of the token. For the token “Mr” the
orthographic form “Mister” is formed by expansion, the token “12” gets the
orthographic form “twelve” and “1997” is transformed to “nineteen ninety
seven”.

 Application of Pronunciation Rules: After the text analysis has been


completed, pronunciation rules can be applied. Letters cannot be transformed
1:1 into phonemes because correspondence is not always parallel. In certain
environments, a single letter can correspond to either no phoneme (for example,
“h” in “caught”) or several phoneme (“m” in “Maximum”). In addition, several
letters can correspond to a single phoneme (“ch” in “rich”). There are two
strategies to determine pronunciation:

 In dictionary-based solution with morphological components, as many


morphemes (words) as possible are stored in a dictionary. Full forms are
generated by means of inflection, derivation and composition rules.
Alternatively, a full form dictionary is used in which all possible word forms are
stored. Pronunciation rules determine the pronunciation of words not found in
the dictionary.

 In a rule based solution, pronunciation rules are generated from the


phonological knowledge of dictionaries. Only words whose pronunciation is a
complete exception are included in the dictionary.
The two applications differ significantly in the size of their dictionaries. The
dictionary-based solution is many times larger than the rules-based solution’s
dictionary of exception. However, dictionary-based solutions can be more exact
than rule-based solution if they have a large enough phonetic dictionary
available.

 Prosody Generation: after the pronunciation has been determined, the


prosody is generated. The degree of naturalness of a TTS system is dependent
on prosodic factors like intonation modelling (phrasing and accentuation),
amplitude modelling and duration modelling (including the duration of sound
and the duration of pauses, which determines the length of the syllable and the

tempos of the speech).

The output of the NLP module is passed to the DSP module. This is where the
actual synthesis of the speech signal happens. In concatenative synthesis the
selection and linking of speech segments take place. For individual sounds the
best option (where several appropriate options are available) are selected from a
database and concatenated.
Classification of speech recognition system

Speech recognition system can be classified in several different types by


describing the type of speech utterance, type of speaker model and type of
vocability that they have the ability to recognize.

The challenges are briefly explained below:

A. Types of speech utterance

Speech recognition are classified according to what type of utterance they have
ability to recognize. They are classified as:
1) Isolated word: Isolated word recognizer usually requires each spoken word to
have quiet (lack of an audio signal) on bot h side of the sample window. It
accepts single word at a time.
2) Connected word: It is similar to isolated word, but it allows separate
utterances to „run-together‟ which contains a minimum pause in between
them.
3) Continuous Speech: it allows the users to speak naturally and in parallel the
computer will determine the content.
4) Spontaneous Speech: It is the type of speech which is natural sounding and is
not rehearsed.

B. Types of speaker model

Speech recognition system is broadly into two main categories based on


speaker models namely speaker dependent and speaker independent.

1) Speaker dependent models: These systems are designed for a specific


speaker. They are easier to develop and more accurate but they are not so
flexible.
2) Speaker independent models: These systems are designed for variety of
speaker. These systems are difficult to develop and less accurate but they are
very much flexible.

C. Types of vocabulary
The vocabulary size of speech recognition system affects the
processing requirements, accuracy and complexity of the system. In voice
recognition system : speech-to-text the types of vocabularies can be classified
as follows:
1) Small vocabulary: single letter.
2) Medium vocabulary: two or three letter words.
3) Large vocabulary: more letter words.
Survey of research papers

Kuldip K. Paliwal and et al in the year 2004 had discussed that without being
affected by their popularity for front end parameter in speech recognition,
the cepstral coefficients which had been obtained from linear prediction
analysis is sensitive to noise. Here, the use of spectral subband centroids had
been discussed by them for robust speech recognition. They discussed that
performance of recognition can be achieved if the centroids are selected
properly as in comparison with MFCC. to construct a dynamic centroid feature
vector a procedure had been proposed which essentially includes the
information of transitional spectral information.

Esfandier Zavarehei and et al in the year 2005, studied that a time-frequency


estimator for enhancement of noisy speech signal in DFT domain is introduced.
It is based on low order auto regressive process which is used for modelling.
The time-varying trajectory of DFT component in speech which has been
formed in Kalman filter state equation. For restarting Kalman filter, a
method has been formed to make alteration on the onsets of speech. The
performance of this method was compared with parametric spectral
substraction and MMSE estimator for the increment of noisy speech. The
resultant of the proposed method is that residual noise is reduced and quality
of speech in improved using Kalman filters.

Ibrahim Patel and et al in the year 2010, had discussed that frequency
spectral information with mel frequency is used to present as an approach in the
recognition of speech for improvement of speech, based on recognition
approach which is represented in HMM. A combination of frequency spectral
information in the conventional Mel spectrum which is based on the approach
of speech recognition. The approach of Mel frequency utilize the frequency
observation in speech within a given resolution resulting in the overlapping
of resolution feature which results in the limit of recognition. In speech
recognition system which is based on HMM, resolution decomposition is
used with a mapping approach in a separating frequency. The result of the study
is that there is an improvement in quality metrics of speech recognition with
respect to the computational time and learning accuracy in speech recognition
system.
Kavita Sharma and Prateek Hakar in the year 2012 has represented
recognition of speech in a broader solutions. It refers to the technology that
will recognize the speech without being targeted at single speaker. Variability in
speech pattern, in speech recognition is the main problem. Speaker
characteristics which include accent, noise and co-articulation are the most
challenging sources in the variation of speech. In speech recognition
system, the function of basilar membrane is copied in the front-end of the
filter bank. To obtain better recognition results it is believed that the band
subdivision is closer to the human perception. In speech recognition system the
filter which is constructed for speech recognition is estimated of noise
and clean speech.

Puneet Kaur, Bhupender Singh and Neha Kapur in the year 2012 had
discussed how to use Hidden Markov Model in the process of recognition of
speech. To develop an ASR(Automatic Speech Recognition) system the
essential three steps necessary are pre-processing, feature Extraction and
recognition and finally hidden markov model is used to get the desired result.
Research persons are continuously trying to develop a perfect ASR system as
there are already huge advancements in the field of digital signal processing
but at the same time performance of the computer are not so high in this field
in terms of speed of response and matching accuracy. The three different
technique used by research fellows are acoustic phonetic approach, pattern
recognition approach and knowledge based approach[4]

Chadawan Ittichaichareon and Patiyuth Pramkeaw in the year 2012 had


discussed that signal processing toolbox has been used in order to
implement the low pass filter with finite impulse response. Computational
implementation and analytical design of finite impulse response filter has been
successfully accomplished by performing the performance evaluation at signal
to noise ratio level. The results are improved in terms of recognition when
low pass filters is used as compared to those process which involves speech
signal without filtering.

Geeta Nijhawan, Poonam Pandit and Shivanker Dev Dhingra in the year
2013 had discussed the techniques of dynamic time warping and mel scale
frequency cepstral coefficient in the isolated speech recognition. Different
features of the spoken word had been extracted from the input speech. A
sample of 5 speakers has been collected and each had spoken 10 digits. A
database is made on this basis. Then feature has been extracted using
MFCC.DTW is used for effectively dealing with various speaking speed. It is
used for similarity measurement between two sequence which varies in speed
and time.
Why this project?

Speech recognition technology is one from the fast growing engineering


technologies.

Nearly 20% people of the world are suffering from various disabilities; many of
them are blind or unable to use their hands effectively. they can share
information with people by operating computer through voice input.

Our project is capable to recognize the speech and convert the input audio into
text and text into audio.
This project can solve may disabilities and communication related problems.
FUTURE SCOPE

Accuracy will become better and better

• Dictation speech recognition will gradually become accepted

Greater use will be made of "intelligent systems" which will attempt to guess
what the speaker intended to say, rather than what was actually said, as people
often misspeak and make unintentional mistakes.

Microphone and sound systems will be designed to adapt more quickly to


changing background noise levels, different environments, with better
recognition of extraneous material to be discarded
WEAKNESS

Homonyms:

Are the words that are differently spelled and have the different meaning but
acquires the same meaning, for example "there" "their", "be" and "bee". This is
a challenge for computer machine to distinguish between such types of phrases
that sound alike.

Speeches:

A second challenge in the process, is to understand the speech uttered by


different users, current systems have a difficulty to separate simultaneous
speeches form multiple users.

Noise factor:

the program requires hearing the words uttered by a human distinctly and
clearly. Any extra sound can create interference, first you need to place system
away from noisy environments and then speak clearly else the machine will
confuse and will mix up the words.
APPLICATIONS

 In Car Systems

 Health Care

 Military

 Training air traffic controllers

 Telephony and other domains

 Usage in education and daily life


Accuracy

Accuracy of speech recognition vary with the following:

 Vocabulary size and confusability

 Speaker dependence vs. independence

 Isolated, discontinuous, or continuous speech

 Task and language constraints

 Read vs. spontaneous speech


Objectives of Project
TEXT-TO-AUDIO

 Text-to-audio (TTA) technology reads aloud digital text — the words on


computers, smartphones, and tablets.

 TTA can help people who struggle with reading.

 There are TTs tools available for nearly every digital device

Text-to-audio (TTA) is a type of assistive technology that reads digital text


aloud. It’s sometimes called “read aloud” technology.

With a click of a button or the touch of a finger, TTA can take words on a
computer or other digital device and convert them into audio. TTA is very
helpful for kids and adults who struggle with reading. But it can also help with
writing and editing, and even with focusing.

TTA works with nearly every personal digital device, including computers,
smartphones, and tablets. All kinds of text files can be read aloud, including
Word and Pages documents. Even online web pages can be read aloud.
Benefits of AUDEXT

Benefits of audio to text


1. Ease of communication – No more illegible handwriting

2. Quick document turnaround

3. Flexibility to work in or out of the office

4. Time saved with increased efficiency and less paperwork

5. Tedious jobs can be streamlined and simplified

6. Speech recognition software can produce documents in less than half the time it takes to type

7. Multitasking – dictation on the go

8. Flexibility to share files across devices

9. Fewer errors – provides an accurate and reliable method of documentation

10. Secure pathways for information transmission

11. Accessible from your iPhone, Android or tablet

12. Workflow visibility – enabling easier management of priorities and turnarounds.


Benefits of text to audio

TTA Benefits for Businesses, Organizations, and Publishers

1. Enhanced customer experience

2. Effective branding across touchpoints


3. Global market penetration

4. Optimized development and maintenance

5. More autonomy for the digital content owner

6. Increased web presence


7. Saved time and money

8. Easier implementation with Internet of Things (IoT)

9. Word-of-mouth marketing

10. Enhanced employee performance with corporate learning programs

TTA Benefits for End Users

 Extend the reach of your content

 Accessibility is relevant

 Populations are evolving


 A growing elderly population depends on technology

 People are increasingly mobile and looking for convenience

 People with different learning styles


CONCLUSION

Text to audio and audio to text system is a rapidly growing aspect of computer
technology and is increasingly playing a more important role in the way we
interact with the system and interfaces across a variety of platforms. We have
identified the various operations and processes involved in text to speech
synthesis. We have also developed a very simple and attractive graphical user
interface which allows the user to type in his/her text provided in the text field
in the application. Our system interfaces with a text to speech engine developed
for American English. In future, we plan to make efforts to create engines for
localized Nigerian language so as to make text to speech technology more
accessible to a wider range of Nigerians. This already exists in some native
languages e.g. Swahili, Konkani, the Vietnamese synthesis system and the
Telugu language Another area of further work is the implementation of a text to
speech system on other platforms, such as telephony systems, ATM machines,
video games and any other platforms where text to speech technology would be
an added advantage and increase functionality.
Screenshot Project
Home Page
Contact
Social Page
Audio to Text converter
Text to Audio Converter

References
1. Lemmetty, S., 1999. Review of Speech Syn1thesis Technology. Masters Dissertation,
Helsinki University Of Technology.
2. Dutoit, T., 1993. High quality text-to-speech synthesis of the French language. Doctoral
dissertation, Faculte Polytechnique de Mons.
3. Suendermann, D., Höge, H., and Black, A., 2010. Challenges in Speech Synthesis. Chen, F.,
Jokinen, K., (eds.), Speech Technology, Springer Science + Business Media LLC.
4. Allen, J., Hunnicutt, M. S., Klatt D., 1987. From Text to Speech: The MITalk system.
Cambridge University Press.
5. Rubin, P., Baer, T., and Mermelstein, P., 1981. An articulatory synthesizer for perceptual
research. Journal of the Acoustical Society of America 70: 321–328.
6. van Santen, J.P.H., Sproat, R. W., Olive, J.P., and Hirschberg, J., 1997. Progress in Speech
Synthesis. Springer.
7. van Santen, J.P.H., 1994. Assignment of segmental duration in text-to-speech synthesis.
Computer Speech & Language, Volume 8, Issue 2, Pages 95–128
8. Wasala, A., Weerasinghe R. , and Gamage, K., 2006, Sinhala Grapheme-to-Phoneme
Conversion and Rules for Schwaepenthesis. Proceedings of the COLING/ACL 2006 Main
Conference Poster Sessions, Sydney, Australia, pp. 890-897.

You might also like