Synopsis
Synopsis
Speech is one of the oldest and most natural means of information exchange between
human. Over the years, Attempts have been made to develop vocally interactive
computers to realise voice/speech synthesis. Obviously such an interface would yield
great benefits. In this case a computer can synthesize text and give out a speech. Text-
To-Speech Synthesis is a Technology that provides a means of converting written text
from a descriptive form to a spoken language that is easily understandable by the end
user (Basically in English Language). It runs on python platform, and the methodology
used was Object Oriented Analysis and Development Methodology; while Expert
System was incorporated for the internal operations of the program. This design will be
geared towards providing a one-way communication interface whereby the computer
communicates with the user by reading out textual document for the purpose of quick
assimilation and reading development.
Introduction
Continuous speech is a set of complicated audio signals which makes producing them artificially
difficult. Speech signals are usually considered as voiced or unvoiced, but in some cases they are
something between these two. Voiced sounds consist of fundamental frequency (F0) and its
harmonic components produced by vocal cords (vocal folds). The vocal tract modifies this excitation
signal causing formant (pole) and sometimes anti-formant (zero) frequencies (Abedjieva et al.,
1993). Each formant frequency has also amplitude and bandwidth and it may be sometimes difficult
to define some of these parameters correctly. The fundamental frequency and formant frequencies
are probably the most important concepts in speech synthesis and also in speech processing. With
purely unvoiced sounds, there is no fundamental frequency in excitation signal and therefore no
harmonic structure either and the excitation can be considered as white noise.
The airflow is forced through a vocal tract constriction which can occur in several places between
glottis and mouth. Some sounds are produced with complete stoppage of airflow followed by a
sudden release, producing an impulsive turbulent excitation often followed by a more protracted
turbulent excitation (Allen et al., 1987). Unvoiced sounds are also usually more silent and less
steady than voiced ones.
Speech signals of the three vowels (/a/ /i/ /u/) are presented in time-frequency domain in Figure 3.
The fundamental frequency is about 100 Hz in all cases and the formant frequencies F1, F2, and F3
with vowel /a/ are approximately 600 Hz, 1000 Hz, and 2500 Hz respectively. With vowel /i/ the
first three formants are 200 Hz, 2300 Hz, and 3000 Hz, and with /u/ 300 Hz, 600 Hz, and 2300 Hz.
Figure 3: The time-frequency domain presentation of vowels /a/, /i/, and /u/.
Existing systems algorithm is shown below in Figure 4. It shows that the system does not have an
avenue to annotate text to the specification of the user rather it speaks plaintext.
START
INPUT TEXT
ALLOCATE ENGINE
AND RESOURCES
SPEAK PLAINTEXT
STOP
Due studies revealed the following inadequacies with already existing systems:
1. Structure analysis: punctuation and formatting do not indicate where paragraphs and other
structures start and end. For example, the final period in “P.D.P.” might be misinterpreted as the
end of a sentence.
2. Text pre-processing: The system only produces the text that is fed into it without any pre-
processing operation occurring.
3. Text-to-phoneme conversion: existing synthesizer system can pronounce tens of thousands or
even hundreds of thousands of words correctly if the word(s) is/are not found in the data
dictionary.
It is expected that the new system will reduce and improve on the problems encountered in the old
system. The system is expected to among other things do the following;
1. The new system has a reasoning process. 2. The new system can do text structuring and annotation.
3. The new system’s speech rate can be adjusted. 4. The Pitch of the voice can be adjusted.
5. You can select between different voices and can even combine or juxtapose them if you want
to create a dialogue between them
6. It has a user friendly interface so that people with less computer knowledge can easily use it
7. It must be compatible with all the vocal engines 8. It complies with SSML specification.
CONTROL STRUCTURE
OUTPUT
RULE INTERPRETER
KNOWLEDGE WORKING
BASE MEMORY
Figure 6: Data flow diagram of the Speech synthesis system Using Gane and Sarson
Symbol User Interface (Source): This can be Graphical User Interface (GUI), or the Command
Line Interface (CLI).
Knowledge Base (Rule set): FreeTTS module/system/engine. This source of the knowledge
includes domain specific facts and heuristics useful for solving problems in the domain. FreeTTS is
an open source speech synthesis system written entirely in the python programming language. It is
based upon Flite. FreeTTS is an implementation of Sun's Java Speech API. FreeTTS supports end-
of-speech markers.
Control Structures: This rule interpreter inference engine applies to the knowledge base
information for solving the problem.
Short term memory: The working memory registers the current problem status and history of solution to date.
PHONOLOGICAL COMPONENT
PHONETIC FEATURE
IMPLEMENTATION RULE
ARTICULATORY MODEL
INPUT STRING
IS INPUT YES
PLAINTEXT
IS INPUT NO
ANNOTATED
WITH TAGS
YES
IGNORE THE TAG INFORMATION AND
NO STRUCTURE THE REST OF THE STRING
IS ANNOTATED
VALUE
INPUT JSML
YES
APPLY THE TAG INFORMATION AND SPEAK STRUCTURED TEXT
STRUCTURE THE STRING VALUE
YES
DEALLOCATE ENGINE
STOP
A determining factor in the choice of programming language is the special connotation (JSML)
given to the program. This is a python specification mark-up language used to annotate spoken
output to the preferred construct of the user. In addition to this, there is the need for a language that
supports third party development of program libraries for use in a particular situation that is not
amongst the specification of the original platform.
Considering these factors, the best choice of programming language was python. Other factors that
made python suitable were its dual nature (i.e. implementing 2 methodologies with one language),
its ability to Implements proper data hiding technique (Encapsulation), its supports for inner abstract
class or object development, and its ability to provide the capability of polymorphism; which is a
key property of the program in question.
1. Menu Bar: This will have the function of selecting through many variables and File chooser system.
2. Monitor: Monitors the reasoning process by specifying the allocation process and de-allocation state
3. Voice System: This shows the different voice option provided by the system
4. Playable session: This maintains the timing of the speech being given out as output, and produces
a speech in synchronism with the rate specified.
5. Playable type: This specifies the type of text to be spoken, whether it is a text file or an annotated JSML file
6. Text-to-Speech activator: This plays the given text and produces an output
7. Player Model: This is a model of all the functioning parts and knowledge base representation in the program
8. Player Panel: This shows the panel and content pane of the basic objects in the program, and
specifies where each object is placed in the system
9. Synthesizer Loader: This loads the Synthesizer engine, allocating and de-allocating resources appropriately