Audext Project
Audext Project
BACHELOR OF TECHNOLOGY
In
Submitted to
Submitted by
Shivam (1838310042)
Adarsh Kumar Tiwari (1838310003)
Technology
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
CERTIFICATE
This is to certify that Shivam (1838310042), Adarsh Kumar tiwari
(1838310003) of B.Tech. Final year, Computer Science & Engineering have
completed their major project entitled Audio to text and text to audio
converter during the year 2021-2022 under my guidance and supervision.
I approve the project for the submission for the partial fulfillment of the
requirement for the award of degree Bachelor of Technology in Computer
Science & Engineering.
Approved by:
Technology
DECLARATION BY CANDIDATE
Shivam (1838310042)
Adarsh Kumar Tiwari (1838310003)
PREFACE
I have made this report file on AUDEXT. I have tried my best to elucidate all the relevant
detail to the topic to be included in the report. While in the beginning I have tried to give a
general view about this topic. My efforts and wholehearted co-corporation of each and
everybody has ended on a successful note 1 express my sincere gratitude to Ms. Anurag
singh(assistant pro. Cs & it) who assist me through out the preparation of this topic. I thank
her for providing me the reinforcement confidence and most importantly the track for the
topic whenever I needed it.
ACKNOWLEDGEMENT
I would like to thank respected Mr. Ajay misra (HOD CS & IT Dept.) and Mr. Anurag
Singh(Assistant professor CS & IT) for giving me such a wonderful opportunity to expand
my knowledge for my own branch and giving me guidelines to build a project report. It
helped me a lot to realize of what we study for Secondly, I would like to thank my parents
who patiently helped me as I went through my work and helped to modify and eliminate
some of the irrelevant or un-necessary stuffs. Thirdly, I would like to thank my friends who
helped me to make my work more organized and well-stacked till the end. Next, I would
thank Microsoft for developing such a wonderful tool like MS Word. It helped my work a lot
to remain error-free Last but clearly not the least, I would thank The Almighty for giving me
strength to complete my report on time.
Shivam
(1838310042)
2. Statement of Problem 2
4. Objectives of 4
5. Methodology 5
6. Process Description 6
10. Limitation 11
11. Conclusion 12
14. References 28
Abstract
AUDIO-TO-TEXT
Voice is the basic, common and efficient form of communication method for
people to interact with each other. Today speech technologies are commonly
available for a limited but interesting range of task. This technologies enable
machines to respond correctly and reliably to human voices and provide useful
and valuable services. As communicating with computer is faster using voice
rather than using keyboard, so people will prefer such system. Communication
among the human being is dominated by spoken language, therefore it is
natural for people to expect voice interfaces with computer. This can be
accomplished by developing voice recognition system:speech-to-text which
allows computer to translate voice request and dictation into text. Voice
recognition system:speech-to-text is the process of converting an acoustic
signal which is captured using a microphone to a set of words. The recorded
data can be used for document preparation.
TEXT-TO-AUDIO
Decoder- It will decode the input signal after feature extraction and will show
the desired output. Language model- it assigns a probability to a sequence of
words by means of a probability distribution.
The main of the project is to recognize speech using MFCC and VQ techniques.
The feature extraction will be done using Mel Frequency Cepstral
Coefficients(MFCC). The steps of MFCC are as follows:-
2) Windowing
4) Mel-Scale
Feature matching will be done using Vector Quantization technique. The steps
are as follows:-
2) To check whether data region for two different speaker are overlapping
each other and in same cluster, observation is needed.
There are different ways to perform speech synthesis. The choice depends on
the task they are used for, but the most widely used method is Concatentive
Synthesis, because it generally produces the most natural-sounding synthesized
speech. Concatenative synthesis is based on the concatenation (or stringing
together) of segments of recorded speech. There are three major sub-types of
concatenative synthesis:
Domain-specific Synthesis: Domain-specific synthesis concatenates pre-
recorded words and phrases to create complete utterances. It is used in
applications where the variety of texts the system will output is limited to a
particular domain, like transit schedule announcements or weather reports. The
technology is very simple to implement, and has been in commercial use for a
long time, in devices like talking clocks and calculators. The level of
naturalness of these systems can be very high because the variety of sentence
types is limited, and they closely match the prosody and intonation of the
original recordings. Because these systems are limited by the words and phrases
in their databases, they are not general-purpose and can only synthesize the
combinations of words and phrases with which they have been pre-
programmed. The blending of words within naturally spoken language however
can still cause problems unless many variations are taken into account. For
example, in nonrhotic dialects of English the "r" in words like "clear" /ˈklɪə/ is
usually only pronounced when the following word has a vowel as its first letter
(e.g. "clear out" is realized as /ˌklɪəɾˈʌʊt/). Likewise in French, many final
consonants become no longer silent if followed by a word that begins with a
vowel, an effect called liaison. This alternation cannot be reproduced by a
simple word-concatenation system, which would require additional complexity
to be context-sensitive. This involves recording the voice of a person speaking
the desired words and phrases. This is useful if only the restricted volume of
phrases and sentences is used and the variety of texts the system will output is
limited to a particular domain e.g. a message in a train station, whether reports
or checking a telephone subscriber’s account balance. .
Text-to-speech takes place in several steps. The TTS systems get a text as input,
which it first must analyze and then transform into a phonetic description. Then
in a further step it generates the prosody. From the information now available, it
can produce a speech signal. The structure of the text-to-speech synthesizer can
be broken down into major modules:
Text Analysis: First the text is segmented into tokens. The token-to-word
conversion creates the orthographic form of the token. For the token “Mr” the
orthographic form “Mister” is formed by expansion, the token “12” gets the
orthographic form “twelve” and “1997” is transformed to “nineteen ninety
seven”.
The output of the NLP module is passed to the DSP module. This is where the
actual synthesis of the speech signal happens. In concatenative synthesis the
selection and linking of speech segments take place. For individual sounds the
best option (where several appropriate options are available) are selected from a
database and concatenated.
Classification of speech recognition system
Speech recognition are classified according to what type of utterance they have
ability to recognize. They are classified as:
1) Isolated word: Isolated word recognizer usually requires each spoken word to
have quiet (lack of an audio signal) on bot h side of the sample window. It
accepts single word at a time.
2) Connected word: It is similar to isolated word, but it allows separate
utterances to „run-together‟ which contains a minimum pause in between
them.
3) Continuous Speech: it allows the users to speak naturally and in parallel the
computer will determine the content.
4) Spontaneous Speech: It is the type of speech which is natural sounding and is
not rehearsed.
C. Types of vocabulary
The vocabulary size of speech recognition system affects the
processing requirements, accuracy and complexity of the system. In voice
recognition system : speech-to-text the types of vocabularies can be classified
as follows:
1) Small vocabulary: single letter.
2) Medium vocabulary: two or three letter words.
3) Large vocabulary: more letter words.
Survey of research papers
Kuldip K. Paliwal and et al in the year 2004 had discussed that without being
affected by their popularity for front end parameter in speech recognition,
the cepstral coefficients which had been obtained from linear prediction
analysis is sensitive to noise. Here, the use of spectral subband centroids had
been discussed by them for robust speech recognition. They discussed that
performance of recognition can be achieved if the centroids are selected
properly as in comparison with MFCC. to construct a dynamic centroid feature
vector a procedure had been proposed which essentially includes the
information of transitional spectral information.
Ibrahim Patel and et al in the year 2010, had discussed that frequency
spectral information with mel frequency is used to present as an approach in the
recognition of speech for improvement of speech, based on recognition
approach which is represented in HMM. A combination of frequency spectral
information in the conventional Mel spectrum which is based on the approach
of speech recognition. The approach of Mel frequency utilize the frequency
observation in speech within a given resolution resulting in the overlapping
of resolution feature which results in the limit of recognition. In speech
recognition system which is based on HMM, resolution decomposition is
used with a mapping approach in a separating frequency. The result of the study
is that there is an improvement in quality metrics of speech recognition with
respect to the computational time and learning accuracy in speech recognition
system.
Kavita Sharma and Prateek Hakar in the year 2012 has represented
recognition of speech in a broader solutions. It refers to the technology that
will recognize the speech without being targeted at single speaker. Variability in
speech pattern, in speech recognition is the main problem. Speaker
characteristics which include accent, noise and co-articulation are the most
challenging sources in the variation of speech. In speech recognition
system, the function of basilar membrane is copied in the front-end of the
filter bank. To obtain better recognition results it is believed that the band
subdivision is closer to the human perception. In speech recognition system the
filter which is constructed for speech recognition is estimated of noise
and clean speech.
Puneet Kaur, Bhupender Singh and Neha Kapur in the year 2012 had
discussed how to use Hidden Markov Model in the process of recognition of
speech. To develop an ASR(Automatic Speech Recognition) system the
essential three steps necessary are pre-processing, feature Extraction and
recognition and finally hidden markov model is used to get the desired result.
Research persons are continuously trying to develop a perfect ASR system as
there are already huge advancements in the field of digital signal processing
but at the same time performance of the computer are not so high in this field
in terms of speed of response and matching accuracy. The three different
technique used by research fellows are acoustic phonetic approach, pattern
recognition approach and knowledge based approach[4]
Geeta Nijhawan, Poonam Pandit and Shivanker Dev Dhingra in the year
2013 had discussed the techniques of dynamic time warping and mel scale
frequency cepstral coefficient in the isolated speech recognition. Different
features of the spoken word had been extracted from the input speech. A
sample of 5 speakers has been collected and each had spoken 10 digits. A
database is made on this basis. Then feature has been extracted using
MFCC.DTW is used for effectively dealing with various speaking speed. It is
used for similarity measurement between two sequence which varies in speed
and time.
Why this project?
Nearly 20% people of the world are suffering from various disabilities; many of
them are blind or unable to use their hands effectively. they can share
information with people by operating computer through voice input.
Our project is capable to recognize the speech and convert the input audio into
text and text into audio.
This project can solve may disabilities and communication related problems.
FUTURE SCOPE
Greater use will be made of "intelligent systems" which will attempt to guess
what the speaker intended to say, rather than what was actually said, as people
often misspeak and make unintentional mistakes.
Homonyms:
Are the words that are differently spelled and have the different meaning but
acquires the same meaning, for example "there" "their", "be" and "bee". This is
a challenge for computer machine to distinguish between such types of phrases
that sound alike.
Speeches:
Noise factor:
the program requires hearing the words uttered by a human distinctly and
clearly. Any extra sound can create interference, first you need to place system
away from noisy environments and then speak clearly else the machine will
confuse and will mix up the words.
APPLICATIONS
In Car Systems
Health Care
Military
There are TTs tools available for nearly every digital device
With a click of a button or the touch of a finger, TTA can take words on a
computer or other digital device and convert them into audio. TTA is very
helpful for kids and adults who struggle with reading. But it can also help with
writing and editing, and even with focusing.
TTA works with nearly every personal digital device, including computers,
smartphones, and tablets. All kinds of text files can be read aloud, including
Word and Pages documents. Even online web pages can be read aloud.
Benefits of AUDEXT
6. Speech recognition software can produce documents in less than half the time it takes to type
9. Word-of-mouth marketing
Accessibility is relevant
Text to audio and audio to text system is a rapidly growing aspect of computer
technology and is increasingly playing a more important role in the way we
interact with the system and interfaces across a variety of platforms. We have
identified the various operations and processes involved in text to speech
synthesis. We have also developed a very simple and attractive graphical user
interface which allows the user to type in his/her text provided in the text field
in the application. Our system interfaces with a text to speech engine developed
for American English. In future, we plan to make efforts to create engines for
localized Nigerian language so as to make text to speech technology more
accessible to a wider range of Nigerians. This already exists in some native
languages e.g. Swahili, Konkani, the Vietnamese synthesis system and the
Telugu language Another area of further work is the implementation of a text to
speech system on other platforms, such as telephony systems, ATM machines,
video games and any other platforms where text to speech technology would be
an added advantage and increase functionality.
Screenshot Project
Home Page
Contact
Social Page
Audio to Text converter
Text to Audio Converter
References
1. Lemmetty, S., 1999. Review of Speech Syn1thesis Technology. Masters Dissertation,
Helsinki University Of Technology.
2. Dutoit, T., 1993. High quality text-to-speech synthesis of the French language. Doctoral
dissertation, Faculte Polytechnique de Mons.
3. Suendermann, D., Höge, H., and Black, A., 2010. Challenges in Speech Synthesis. Chen, F.,
Jokinen, K., (eds.), Speech Technology, Springer Science + Business Media LLC.
4. Allen, J., Hunnicutt, M. S., Klatt D., 1987. From Text to Speech: The MITalk system.
Cambridge University Press.
5. Rubin, P., Baer, T., and Mermelstein, P., 1981. An articulatory synthesizer for perceptual
research. Journal of the Acoustical Society of America 70: 321–328.
6. van Santen, J.P.H., Sproat, R. W., Olive, J.P., and Hirschberg, J., 1997. Progress in Speech
Synthesis. Springer.
7. van Santen, J.P.H., 1994. Assignment of segmental duration in text-to-speech synthesis.
Computer Speech & Language, Volume 8, Issue 2, Pages 95–128
8. Wasala, A., Weerasinghe R. , and Gamage, K., 2006, Sinhala Grapheme-to-Phoneme
Conversion and Rules for Schwaepenthesis. Proceedings of the COLING/ACL 2006 Main
Conference Poster Sessions, Sydney, Australia, pp. 890-897.