TECHNICAL SEMINAR DOCUMENTATION ON VOICE MORPHING
MAREDUGONDA SONIA
18UD1A0537
BACHELOR OF TECHNOLOGY
COMPUTER SCIENCE & ENGINEERING
TRINITY COLLEGE OF ENGINEERING AND TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to JNTU, Hyderabad)
PEDDAPALLI, KARIMNAGAR-505172
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
TRINITY COLLEGE OF ENGINEERING AND TECHNOLOGY
(Approved by AICTE, New Delhi, Affiliated to JNTU, Hyderabad)
PEDDAPALLI, KARIMNAGAR-505172
2018-2022
----------------------------------------------------------------------------------------------------------
CERTIFICATE
This is to certify that the seminar documentation entitled “VOICE MORPHING”
is submitted by MAREDUGONDA SONIA bearing HT. No (18UD1A0537) in
IVth B.Tech (CSE) I Semester.
Head of the Department
ABSTRACT
Voice morphing means the transition of one speech signal into another. Like image
morphing, speech morphing aims to preserve the shared characteristics of the starting and final
signals, while generating a smooth transition between them. Speech morphing is analogous to
image morphing. In image morphing the in-between images all show one face smoothly changing
its shape and texture until it turns into the target face. It is this feature that a speech morph
should possess. One speech signal should smoothly change into another, keeping the shared
characteristics of the starting and ending signals but smoothly changing the other properties.
Voice morphing is a technique for modifying a (source) speaker’s speech to sound as if it
were spoken by a different (target) speaker. In Simpler terms it is being able to change the speech
of one speaker to that of another speaker. Technology developed at the Los Alamos National
Laboratory in New Mexico, USA by George Papuan. Applications for voice morphing range from
recreational ones to security ones.
We need to develop same time stretching algorithm so that we can implement pitch
shifting. We obtain the residue of the source signal and stretch it according to the value of the
constant. The constant indicates what is the position of the morphed signal in between the
source and signal.
We break the residue signal into small windows and introduce fade in and fade out for
each block. We recombine everything to form the pitch shifted signal. Based on the alpha we can
time stretch the residue according to our requirements. We should resample the pitch shifted
signal so that it is played at a faster rate. If we inverse filter the resampled pitch shifted residue
then we can affect morphing. And an accurate copy of a person’s voice can be made that can
wishes to say, anything in the voice of someone else.
INDEX
S.NO TITLE P.NO
ABSTRACT
1. Introduction 1
2. An Introspection of the morphing process 2
3. Morphing Process: A comprehensive analysis 3
3.1 Acoustics of speech production
3.2 Preprocessing
3.3 Signal Acquisition
3.4 Windowing
4. Morphing 6
4.1 Matching and Warping: Background theory
4.2 Dynamic Time Warping
5. Morphing Stage 8
5.1 Combination of the envelope information
5.2 Combination of the pitch information residual
5.3 Combination of the pitch information
6. Block Diagram 12
7. Application 13
8. Advantages 14
9. Disadvantages 14
10. Future Scope 15
11. Conclusion 16
12. References 16
1. INTRODUCTION
Voice morphing means the transition of one speech signal into another. Like image
morphing, speech morphing aims to preserve the shared characteristics of the starting and final
signals, while generating a smooth transition between them. Speech morphing is analogous to
image morphing. In image morphing the in-between images all show one face smoothly changing
its shape and texture until it turns into the target face. It is this feature that a speech morph
should possess. One speech signal should smoothly change into another, keeping the shared
characteristics of the starting and ending signals but smoothly changing the other properties.
The major properties of concern as far as a speech signal is concerned are its pitch and
envelope information. These two reside in a convolved form in a speech signal. Hence some
efficient method for extracting each of these is necessary. We have adopted an uncomplicated
approach namely cepstral analysis to do the same. Pitch and formant information in each signal
is extracted using the cepstral approach. Necessary processing to obtain the morphed speech
signal include methods like Cross fading of envelope information. Dynamic Time Warping to
match the major signal features (pitch) and Signal Re-estimation to convert the morphed speech
signal back into the acoustic waveform.
Voice morphing means the smooth transition of the one speech signal to another, keeping
the shared characteristics of starting and ending signals. Pitch and Envelop Information are the
two major properties of speech signal. Cepstral analysis is used to extract characteristics. Speech
morphing is analogous to image morphing. In Image morphing the in between images all show
one face smoothly changing its shape and texture until it turns into the target face. It is this
feature that a speech morph should possess. One speech signal should smoothly change into
another, keeping the shared characteristics of the starting and ending signals but smoothly
changing the other properties.
1
2. AN INTROSPECTION OF THE MORPHING PROCESS
Speech morphing can be achieved by transforming the signal’s representation from the
acoustic waveform obtained by sampling of the analog signal, with which many people are
familiar with, to another representation. To prepare the signal for the transformation, it split into
a number of ‘frames’ – sections of the waveform. The transformation is then applied to each
frame of the signal. This provides another way of viewing the signal information. The new
representation (said to be in the frequency domain) describes the average energy present at each
frequency band.
However, after the morphing has been performed, the legacy of the earlier analysis
becomes apparent. The conversion of the sound to a representation in which the pitch and
spectral envelope can be separated loses some information. Therefore, this information has to
be re -estimated for the morphed sound. This process obtains an acoustic waveform, which can
then be stored or listened to.
Fig: Schematic block diagram of the speech morphing process
2
3. MORPHING PROCESS: A COMPREHENSIVE ANALYSIS
The algorithm to be used is shown in the simplified block diagram given below. The algorithm
contains a number of fundamental signal processing methods including sampling, the discrete
Fourier transform and its inverse, cepstral analysis. However the main processes can be
categorized as follows.
I. Preprocessing or representation conversion: This involves process like signal acquisition
in discrete form and windowing.
II. Cepstral analysis or Pitch and Envelope analysis: This process will extract the pitch and
formant information in the speech signal.
III. Morphing which includes Warping and interpolation.
IV. Signal re-estimation.
Fig: Block diagram of the simplified speech morphing algorithm
3
3.1 Acoustics of speech production
Speech production can be viewed as a filtering operation in which a sound source excites
a vocal tract filter. The source may be periodic, resulting in voiced speech, or noisy and a periodic,
causing unvoiced speech. As a periodic signal, voiced speech has a spectra consisting of
harmonics of the fundamental frequency of the vocal cord vibrating; this frequency often
abbreviated as F0, is the physical aspect of the speech signal corresponding to the perceived
pitch. Thus pitch refers to the fundamental frequency of the vocal cord vibrations or the resulting
periodicity in the speech signal. This F0 can be determined either from periodicity in the time
domain or from the regularly spaced harmonics in the frequency domain.
3.2 Preprocessing
This section shall introduce the major concepts associated with processing a speech signal
and transforming it to the new required representation to affect the morph. This process takes
place for each of the signals involved with the morph.
3.3 Signal Acquisition
Before any processing can begin, the sound signal that is created by some real-world
process must be ported to the computer by some method. This is called sampling. A fundamental
aspect of a digital signal (in this case sound) is that it is based on processing sequences of samples.
When a natural process, such as a musical instrument, produces sound the signal produced is
analog (continuous time) because it is defined along a continuous of times. A discrete-time signal
is represented by a sequence of numbers – the signal is only defined at discrete times. A digital
signal is a special instance of a discrete-time signal – both time and amplitude are discrete. Each
discrete representation of the signal is termed a sample.
Fig: Signal Acquisition
The input speech signals are taken using MIC and CODEC. The analog speech signal is
converted into the discrete form. This completes the signal acquisition phase.
4
3.4 Windowing
A DFT (Discrete Fourier Transformation) can only deal with a finite amount of information.
Therefore, a long signal must be split up into several segments. These are called frames.
Generally, speech signals are constantly changing and so the aim is to make the frame short
enough to make the segment almost stationary and yet long enough to resolve consecutive pitch
harmonics. Therefore, the length of such frames tends to be in the region of 25 to 75 milli
seconds. There are several possible windows. A selection is:
The Hamming Window
W (n) = 0.5 – 0.5 cos (2 n /N) when 0<=n<=N,
= 0 otherwise….
Fig: Windowing
The frequency-domain spectrum of the Hamming window is much smoother than that
of the rectangular window and is commonly used in spectral analysis. The windowing function
splits the signal into time-weighted frames.
5
4. MORPHING
4.1 Matching and Warping: Background theory
Both signals will have a number of the ‘time-varying properties. To create an effective
morph, it is necessary to match one or more of these properties of each signal to those of the
other signal in some way. The property of concern is the pitch of the signal – although other
properties such as the amplitude could be used – and will have several features. It is almost
certain that matching features do not occur at the same point in each signal. Therefore, the
feature must be moved to some point in between the position in the first sound and the second.
In other words, to smoothly morph the pitch information, the pitch present in each signals need
to be matched and then the amplitude at each frequency cross faded. To perform the pitch
matching, a pitch contour for the entire signal is required. This is obtained by using the pitch peak
location in each cepstral pitch slice.
Consider the simple case of two signals, each with two features occurring in
different positions as shown in the figure below.
Fig: The match path between two signals with differently located features
The match path shows the amount of movement (or warping) required in order aligning
corresponding features in time. Such a match path is obtained by Dynamic Time Warping (DTW).
6
4.2 Dynamic Time Warping
Speaker recognition and speech recognition are two important applications of speech
processing. These applications are essentially pattern recognition problems, which is a large field.
Automatic Speech Recognition (ASR) systems employ time normalization. This is the process by
which time-varying features within the word are brought into line.
It can be used directly as a distance measure. Such time-warping algorithm is usually
implemented by dynamic programming and is known as Dynamic Time Warping (DTW) is used to
find the best match between the features of the two sounds – in these cases, their pitch. DTW
enables a match path to be created. This shows how each element in one signal corresponds to
each element in the second signal.
To understand DTW, two concepts need to be deal with:
Features: The information in each signal must be requested in some manner.
Distances: Some form of metric must be used to obtain a match path. These are two types:
1. Local: A computational difference between a feature of one signal and a feature of the other.
2. Global: The overall computational difference between an entire signal and another signal of
possibly different length.
In this use of DTW, a path between two pitch contours is required. Each feature vector
will be a single value. In other uses of DTW, however, such feature vectors could be large arrays
of values. Since the feature vectors could possibly have multiply elements, a means of calculating
the local distance is required. The distance measure between two feature vectors is calculated
using Euclidean distance metric.
The global distance is the overall difference between the two signals. Audio is a time-
dependent process. To produce a global distance measure, time alignment must be performed-
the matching of similar features and the stretching and compressing, in time, of others. Instead
of considering every possible match path which would be very inefficient, a number of constraints
are imposed upon the matching process.
7
5. MORPHING STAGE
The overall aim in this section is to make the smooth transition from signal 1 to signal 2.
This is partially accomplished by the 2D array of the match path provided by the DTW. At this
stage, it was decided exactly what form the morph would take. The implementation chosen was
to perform the morph in the duration of the longest signal. In other words, the final morphed
speech signal would have the duration of the longest signal. In other to accomplish this, the 2D
array is interpolated to provide the desired duration.
At the beginning of the morph, the pitch peak will take on more characteristics of the
signal 1 pitch peak – peak value morph, the peak will bear more resemblance to that of the signal
2 peaks. The variable 1 is used to control the balance between signal 1 and signal 2. At the
beginning of the morph, 1 has the value 0 and upon completion. 1 has the value 1.
To illustrate the morph process, these two cepstral slices shall be used.
There are three stages:
1. Combination of the envelope information.
2. Combination of the pitch information residual – the pitch information excluding the
pitch peak.
3. Combination of the pitch peak information.
Fig: A second sample cepstral slice with the pitch p
8
5.1 Combination of the envelope information
We can say that the best morphs are obtained when the envelope information is merely
cross faded, as opposed to employing any pre-warping of features, and so this approach is
adopted here. To cross-faded any information in the cepstral domain, care must be taken. Due
to the properties of the logarithms employed in the cepstral analysis stage, multiplication is
transformed into addition. Therefore, if a cross-faded between the two envelopes were
attempted, multiplication would in fact take place. Consequently, each envelope must be
transformed back into the frequency domain (involving an inverse logarithm) before the cross-
fade is performed. Once the envelopes have been successfully cross faded according to the
weighting determined by 1, the morphed envelope is once again transformed back into the
cepstral domain. This new cepstral slice forms the basis of the completed morph slice.
Fig: Cross fading of the formants
9
5.2 Combination of the pitch information residual
The pitch information residual is the pitch information section of the cepstral slice with
the pitch peak also removed by liftering. To produce the morphed residual, it is combined in a
similar way to that of the envelope information. No further matching is performed. It is simply
transformed back into the frequency domain and cross-faded with respect to 1. Once the cross-
fade has been performed, it is again transformed into the cepstral domain. The information is
now combined with the new morph cepstral slice (currently containing envelope information).
The only remaining part to be morphed is the pitch peak area.
Fig: Cross fading of the Pitch information
10
5.3 Combination of the pitch peak information
As stated above, to produce a satisfying morph, it must have just one pitch. This means
that the morph slice must have a pitch peak, which has characteristics of both signal 1 and signal
2. Therefore, an artificial peak need to be generated to satisfy this requirement. The positions of
the signal 1 and signal 2 pitch peaks are stored in an array (created during the pre-processing),
which means that the desired pitch peak location can easily be calculated.
To manufacture the peak, the following process is performed,
I. Each pitch peak area is littered from its respective slice. Although the alignment of the
pitch peaks will not match with respect to the cepstral slices, the pitch peak areas are
littered in such a way as to align the peaks with respect to the littered area.
II. The two lifted cepstral slices are then transformed back into the frequency domain
where they can be cross faded with respect to. The cross-fade is then transformed back
into the cepstral domain.
III. The morphed pitch peak area is now placed at the appropriate point in the morph
cepstral slice to complete the process.
The morphing process is now complete. The final series of morphed cepstral slices is
transformed back into the frequency domain. All that remains to be is re-estimate the
waveform.
11
6.BLOCK DIAGRAM
The whole morphing process is summarized using the detailed block diagram as shown below.
12
7. APPLICATION
• Voice morphing is a powerful battlefield weapon which can be used to provide fake orders
to the enemy’s troops, appearing to come from their own commanders.
• Cartoons where one person voice of number of characters.
• It has possibilities in military psychological warfare and subversion, particularly in
conjunction with the use of recorded telephone conversations as evidence in courts of
law.
• An agency can use voice morphing to provide a fake confession or incriminating evidence
appearing to be spoken by a suspect which is reality is fake.
• Text to speech customization systems where speech can be produced with a desired voice
or email may be red out in the sender’s voice.
• In relacing or enhancing the skills involved in producing sound tracks for animated
characters, dubbing or voice impersonating which may be used in the entertainment
industry.
• For voice disguising of a speaker especially in the internet chat rooms.
• In public speech systems we can make the sound to be of a popular public speaker.
13
8. ADVANTAGES
• Allows speech model to be duplicated and an exact copy of a person’s voice.
• Powerful combat zone weapon.
9. DISADVANTAGES
• Use to pull out the useful information.
• It hides the actual identity of the user.
14
10. FUTURE SCOPE
There are several areas in which future work should be carried out to improve the
technique described here and extend the field of speech morphing in general. The time required
to generate a morph is dominated by the signal re-estimation process. Even a small number of
iterations takes a significant amount of time even to re-estimate signals of approximately one
second duration. Although in speech morphing, an inevitable loss of quality due to manipulation
occurs and so less iteration is required, an improved re-estimation algorithm is required.
Several the processes, such as the matching and signal re-estimation are very unrefined
and inefficient methods but do produce satisfactory morphs. Concentration on the issues
described above for future work and extensions to the speech morphing principle to produce
systems which create extremely convincing and satisfying speech morphs.
The speech morphing concept can be extended to include audio sounds in general. This
area offers many possible applications including sound synthesis. One is to digitally model the
sounds physical source and provide several parameters to produce a synthetic note of the desired
pitch. Another is to take two notes which bound the desired note and use the principles used in
speech morphing to manufacture a note which contains the shared characteristics of the
bounding notes but whose other properties have been altered to form a new note. The use of
pitch manipulation within the algorithm also has an interesting potential use. In the interests of
security, it is sometimes necessary for people to disguise the identity of their voice. An interesting
way of doing this is to alter the pitch of the sound in real- time using sophisticated methods.
15
11. CONCLUSION
The approach we have adopted separates the sounds into two forms: spectral envelope
information and pitch and voicing information. These can then be independently modified. The
morph is generated by splitting each sound into two forms: a pitch representation and an
envelope representation. The pitch peaks are then obtained from the [itch spectrograms to
create a pitch contour for each sound. Dynamic time Warping of these contours aligns the sounds
with respect to their pitches. At each corresponding frame, the pitch, voicing, and envelope
information are separately morphed to produce a final morphed frame. These frames are then
converted back into a time domain waveform using the signal re-estimation algorithm.
• The approach separates the sounds into two forms: spectral envelope information and
pitch information.
• These can then be independently modified.
• Dynamic time wrapping of these contours aligns the sounds with respect to their
pitches.
• At the corresponding frame, the pitch and envelope information are separately
morphed to produce a final morphed frame.
• These frames are then converted bac into a time domain waveform using the signal re-
estimation algorithm.
12.REFERENCES
• Velbert H, “Voice transformation using PSOLA technique”, Speech communication,
Vol .11, No 2-3-1992.
• Arslan L, Speaker transformation algorithm using segmental codebooks, Speech
Communication, No.28,1999.
16