1 BTP Report 07010245 Suraj S Sheth16april2011

Extraction of Pitch from
Speech Signals using

Hilbert Huang Transform
Submitted for the fulfilment of the requirements for the award of
the degree of Bachelor of Technology
by
Suraj Satishkumar Sheth (ROLL No. : 07010245)
Supervised by
Dr. S. R. M. Prasanna
Department of Electronics & Communication
Engineering
Indian Institute of Technology, Guwahati
Year: July 2010 to May 2011
-1-
CERTIFICATE
It is certified that the work contained in the report entitled

“Extraction of Pitch from Speech Signals using Hilbert Huang
Transform” is a bona fide work of Suraj Satishkumar Sheth
(Roll No. 07010245), which has been carried out in the
Department of Electronics and Communication Engineering,
Indian Institute of Technology (IIT) Guwahati under my
supervision and this work has not been submitted elsewhere for a
degree.
Dr. S. R. M. PRASANNA
Associate Professor,
Department of Electronics and Communication Engineering,
Indian Institute of Technology Guwahati,
Guwahati – 781039, INDIA

April, 2011
GUWAHATI
-2-
Acknowledgements
I feel it a great privilege in expressing my deepest and most sincere

gratitude to my supervisor, Dr. S. R. M. Prasanna for the most
valuable guidance provided to me during the course of this project. Such a
successful work would not have been possible without his guidance. I would
also like to thank the Head of the Department and other faculty members for
their kind help in carrying out this work.
I am very grateful to the non-teaching staff and students of the

department who have always helped me out from the very beginning of this
work. I sincerely thank Mr. Gyandhar Pradhan and Mr. Govind who have
helped me.
-3-
Contents
1. Abstract.......................................................................................................................................... 8
2. Introduction ................................................................................................................................... 9
3. My contributions ......................................................................................................................... 10
4. Empirical Mode Decomposition .................................................................................................. 12
5. Mode - Mixing ............................................................................................................................. 17
6. Ensemble Empirical Mode Decomposition ............................................................................... 169
7. Neighbourhood Limited Empirical Mode Decomposition .......................................................... 21
8. Pitch Extraction using Empirical Mode Decomposition ............................................................. 23
9. Potential Applications ................................................................................................................ 29
10. Results and Conclusions ........................................................................................................... 30
11. Future Work............................................................................................................................... 31
12. References ................................................................................................................................. 32
-4-
List of Figures
Figure 1. Speech Signal along with Maxima and Minima ......................................................................... 13
Figure 2. Speech Signal along with Maximum, minimum and mean-envelope. ........................................ 14
Figure 3. A synthetic signal with two sinusoids and the corresponding IMFs. .......................................... 16
Figure 4. In this case, it separates the frequency components well. But, in a general case, it faces
the problem of “Mode-mixing”, in which a single frequency component will be present in more
than one IMF.
. ................................................................................................................................................................... 16
Figure 5. A speech signal and the corresponding Intrinsic Mode Frequencies depicting the Mode-
Mixing phenomenon, specifically in IMF4 and IMF5
................................................................................................................................................................... 17
Figure 6. The Speech signal and its Fourth and Fifth Intrinsic Mode Frequencies
. ................................................................................................................................................................... 18
Figure 7. Input Signal Concatenation of two sinusoids. ............................................................................ 19
Figure 8. EMD Single IMF contains two modes. ...................................................................................... 19
Figure 9 : EEMD : Highest Frequency IMF. ............................................................................................ 19
Figure 10: EEMD : Lowest Frequency IMF .............................................................................................. 19
Figure 11. Input Signal : x=1:1000;y=[sin(0.1*x) sin(0.8*x)]; .................................................................. 21
-5-
Figure 12. EMD : Single IMF contains 2 modes ........................................................................................ 21
Figure 13. NLEMD : Highest Frequency IMF ........................................................................................... 21
Figure 14. NLEMD : Lowest Frequency IMF ............................................................................................ 21
Figure 15. The speech signal and the corresponding IMFs ........................................................................ 24
Figure 16. Filtered IMFs along with the speech signal .............................................................................. 25
Figure 17. The speech signal and the corresponding IMFs determined by modified
Neighbourhood Limited Empirical Mode Decomposition
.................................................................................................................................................................... 26
Figure 18. The speech signal and the corresponding Filtered IMFs obtained using EMD,
Neighbourhood Limited Criterion and Filtering
.................................................................................................................................................................... 27
Figure 19. (a) Magnified view of the Speech signal, (b) Filtered IMF6, (c) Envelope of Filtered
IMF6 and the Threshold in red
.................................................................................................................................................................... 28
IMF6 and the Threshold in red, (d)Epochs determined by our novel HHT based method and (e)
Epochs determined by ZFF method............................................................................................................ 29
Figure 21. (a)Magnified view of the Speech signal, (b)Filtered IMF6, (c)Time-frequency
representation of IMF6, (d)Epochs determined by our novel HHT based method and (e) Epochs
determined by ZFF method
.................................................................................................................................................................... 30
Figure 22. (a)Magnified view of the Speech signal, (b)Filtered IMF6, (c)Time-frequency
representation of IMF6, (d)Epochs determined by our novel HHT based method and (e) Epochs
determined by ZFF method ........................................................................................................................ 31
-6-
Tables
Table 1: Comparison of Frequency Representation techniques…………………………………15
-7-
1. Abstract
In this report, a novel method for instantaneous pitch extraction using Hilbert Huang Transform is
proposed. Unlike traditional methods, truncating data and segmenting them into windows is not
necessary in this method. Also, the stationarity and the linearity assumptions are not required.
The instantaneous pitch is derived from the Intrinsic Mode Frequencies which in turn are
obtained from the speech signal through Empirical Mode Decomposition. We have explored
various methods to overcome the shortcomings of Empirical Mode Decomposition and proposed
several new variants of these methods which tackle the specific problem in hand. The robustness
is increased by using a novel variant of Neighbourhood Limited Empirical Mode Decomposition.
Also, the Filtered Intrinsic Mode Frequencies are used to determine the best contestant for pitch
in place of the Intrinsic Mode Frequencies themselves.
The results of the novel algorithm are compared to the epochs obtained using the Zero-Frequency
Filter method. The accuracy is found to be 96.43% for a dataset containing about 5000 epochs
determined by the Zero-Frequency Filter method.
The envelope of the modified Intrinsic Mode Frequency of the best contestant for pitch is also
shown to be useful for several applications like Digit Recognition. The algorithm provides pitch
with a high time-resolution, high frequency-resolution and is not affected significantly by
windowing effect as the short-time analysis is not used. The bases used are adaptive and not a
priori which is important for this specific problem.
-8-
2. Introduction
The pitch of speech signal plays an important role in different speech processing applications
including speaker recognition, automatic speech recognition, speech enhancement, analysis and
modelling of speech prosody, low-bit-rate speech coding, etc. Although many methods exist for
pitch extraction, reliability and accuracy are not good. Usually, the instantaneous values of pitch
are different even within a frame. So, we need good time resolution. And we need good
frequency resolution especially in the case of pitch which can be used for various applications.
Time-frequency representation is an important component of Signal Processing and has potential

applications. Examples include STFT, Wavelet Transform, etc. Most of the data-analysis methods
assume that the data is linear and stationary. Techniques like Wavelet analysis work for non-
stationary linear data. All these techniques face one or more of these problems - Windowing
effect, low time resolution, low frequency resolution, fixed basis functions, etc. Usually, speech is
both non-linear and non-stationary. Hence, we need adaptive basis functions for speech.
Hilbert Huang Transform is a time-frequency representation technique developed recently by N E

Huang and his group [1]. It is an empirical data-analysis method capable of processing non-linear
and non-stationary data. It provides the ingredients to obtain the adaptive bases for the
representation. The HHT has given much better and sharper results than other conventional time-
frequency-energy representation methods for various problems. Additionally, the HHT has
revealed true physical meanings in many of the data examined.
The most important component of Hilbert Huang Transform (HHT) is Empirical Mode
Decomposition (EMD). It separates components with different frequency modes and produces
sensible representation even from non-linear and non-stationary data.
After Empirical Mode Decomposition, the next component is Hilbert Spectral Analysis. Hilbert
Spectral Analysis provides the frequency-time representation of the input non-linear and non-
stationary signal. Hilbert Huang transform (or particularly, the Empirical Mode Decomposition),
in a weak sense, captures the highest frequency component at each iteration producing IMFs
having different frequency modes at a particular time-instant. But, it may happen that a particular
IMF has different frequency modes at different time-instants. This phenomenon is called 'mode-
mixing '[2]. An example is given in figure 6. Mode-mixing does not affect Hilbert Spectral
Analysis as in the final time-frequency representation, all the IMFs are mapped to a single graph.
But, if we want to use the IMFs individually and want to ensure that each IMF has a single
frequency mode as in the case of pitch extraction, we need to get rid of 'mode-mixing'. In this
report, we explain the various techniques for enhancing the decomposition of speech signal into
Intrinsic Mode Frequencies and get rid of mode-mixing in Hilbert Huang Transform and propose
a new method for pitch extraction using Hilbert Huang Transform.
-9-
3. My contributions
 Problem Formulation
 Literature Survey for Hilbert Huang Transform and Empirical Mode Decomposition
 Writing the code for Empirical Mode Decomposition and Hilbert Huang Transform in
Matlab
 Testing a set of synthetic signals to understand the features of Empirical Mode

Decomposition and Hilbert Huang Transform
 Exploring the applications of Hilbert Huang Transform to Speech Signals
 Trying out the Noise assisted analysis of data - Ensemble Empirical Mode
Decomposition to get rid of mode-mixing, one of the shortcomings of Hilbert Huang
Transform and writing codes for it.
 Codes for Neighbourhood Limited Empirical Mode Decomposition algorithm
 Literature Survey for pitch extraction algorithms
 Derivation of the algorithm for pitch extraction using Empirical Mode Decomposition
 Writing the code for pitch extraction using Empirical Mode Decomposition
 Location of Epochs using Empirical Mode Decomposition and comparing it to the

Epochs derived from Zero Frequency Filter Method [5]
 Optimising Ensemble Empirical Mode Decomposition for reducing time complexity.

Testing this method on Speech Signals and the corresponding Intrinsic Mode
Frequencies
 Application of Neighbourhood Limited Empirical Mode Decomposition to synthetic and

speech signals and designing novel variants of Neighbourhood Limited Empirical
Mode Decomposition for the specific problem in hand – Pitch Extraction
 Developing a novel combination of filtering and Neighbourhood Limited Empirical

Mode Decomposition methods for the extraction of pitch to increase efficiency
- 10 -
 Analysing the pitch and the epochs obtained using the modified filtering, novel variant of
Neighbourhood Limited Empirical Mode Decomposition and Hilbert Spectral Analysis
 Exploring the use of features of modified Intrinsic Mode Frequency for various purposes
including “Digit Recognition”
 Comparing the proposed novel epoch detection algorithm to Zero-Frequency Filter

method with a dataset containing about 5000 epochs determined by the Zero-Frequency
Filter method and analysing the results
- 11 -
4. Empirical Mode Decomposition
The fundamental part of the Hilbert Huang Transform is the Empirical Mode Decomposition
(EMD) method. Using the EMD method, a signal can be decomposed into a finite and often small
number of frequency modes called Intrinsic Mode Functions (IMF). An IMF represents a simple
frequency mode similar to the simple harmonic function, but it is much more general. The
Intrinsic Mode Frequency has amplitude and frequency as functions of time unlike a simple
harmonic component.
The algorithm can be described as:
Step 1: initially assume z0 = x(t) and i=1; (x(t) is the speech signal)
Step 2: To find out the (i+1)th Intrinsic Mode Frequency
(a) Initially assume hi(k-1) = zi and k=1;
(b) Find out the local extrema of hi(k-1);
(c) Construct the maxima envelope and minima envelope of hi(k-1) by interpolation;
(d) Calculate the-mean envelope mi(k-1) from the maxima envelope and the minima
envelope
(e) Subtract the mean envelope from the input signal of this stage
hi(k) = hi(k-1) - mi(k-1)
(f) Check whether hi(k) satisfies the properties of an Intrinsic Mode Frequency
Or else, go back to b)
Step 3: Define z(i+1) = z(i) - hi(k)
Step 4: Check whether the required precision is obtained, if not, go back to Step 2
- 12 -
The procedure followed to compute the Intrinsic Mode Frequencies is:
1) Initially, the maxima and minima of the speech signal are computed.
Fig. 1: Speech Signal along with Maxima and Minima
2) Using curve fitting technique, in this case, the in-built spline function in matlab, a maxima-
envelope is generated for the set of maxima and a minima-envelope is generated for the set of
minima.
3) A mean-envelope (in MAGENTA, pink colour) is generated which is the mean of the
maxima-envelope (in RED colour) and minima-envelope (in GREEN colour) computed at each
sampling point.
- 13 -
Fig. 2: Speech Signal along with maximum, minimum and mean-envelope
4 )The mean envelope is subtracted from the speech signal, (s(t)) to get a new signal hi(k)(t).
hi(k)(t) is a potential IMF.
5) Check whether hi(k)(t) satisfies the properties required to be an IMF :
a) The number of extrema and the number of zero crossings can at-most differ by one
b) The mean value of the envelope (steps 1, 2 and 3 applied to hi(k)(t)) should be zero at
all
points
6) Also check whether hi(k)(t) satisfies the Standard Deviation criteria where Standard Deviation
is
defined by,
- 14 -
Here, h1(0)(t)=s(t). As the number of iterations increase, h1(k-1)(t) becomes the potential IMF in
iteration k-1 and h1k(t) becomes the potential IMF in iteration k.
The Standard deviation threshold is usually set at a number between 02 to 0.3.
7) If hi(k)(t) satisfies the criteria mentioned in step 5 and step 6, c1(t) = hi(k)(t) is declared to be the
IMF (1st IMF in this case). This IMF is subtracted from the original speech signal s(t) to get s1(t).
We repeat step 1 to step 7 assuming s1(t) to be the input. Finally, this process is stopped at a point
where the residual signal is a monotonic function or a constant signal.
8) So, for each input signal, we get a set of IMFs and possibly, a residue which is a constant
signal or a monotonic signal.
Now, let us compare the features of three different types of Frequency representation techniques:
Hilbert-Huang
Fourier Transform Wavelet Transform
Transform
Basis a priori a priori adaptive
Frequency Not local Not local Local, instantaneous
Presentation Energy-Frequency Energy-Time-Frequency Energy-Time-Frequency
Non-linear NO NO YES
Non-stationary NO YES YES
Table 1: Comparison of Frequency Representation techniques
- 15 -
eg.: x=1:10000; y=sin(x*0.01)+sin(x*0.1); The signal along with the two IMFs are plotted
Fig. 3 : A synthetic signal with two sinusoids and the corresponding IMFs
Fig. 4: A synthetic signal having two sinusoids of different lengths. The highest
frequency sinusoid persists for a longer duration. The second and third subplots are of the
Intrinsic Mode Frequencies obtained using Empirical Mode Decomposition.
In this case, it separates the frequency components well. But, in a general case, it faces the
problem of “Mode-mixing”, in which a single frequency component will be present in more than
one IMF.
- 16 -
5. Mode-Mixing
It is a phenomenon in which a single frequency mode is present in more than one Intrinsic Mode
Frequencies which happens due to picking up of the highest frequency component at each point
by the Empirical Mode Decomposition. The figure below depicts this.
Figure 5. A speech signal and the corresponding Intrinsic Mode Frequencies depicting the Mode-
mixing phenomenon, specifically in IMF4 and IMF5
- 17 -
Now, let us observe the magnified view of the Speech Signal, the 4th Intrinsic Mode Frequency
and the 5th Intrinsic Mode Frequency.
Figure 6. The Speech signal and its Fourth and Fifth Intrinsic Mode Frequencies
The Green highlighted portion has the same characteristics which are spread over more than One
Intrinsic Mode Frequencies. We want these to be present in a single Intrinsic Mode Frequency.
This problem arises when we are processing the Intrinsic Mode Frequencies individually. When
the application requires the whole time-frequency representation, mode-mixing doesn’t come into
picture as the frequency components remain intact. This is not a shortcoming of the Hilbert
Huang Transform, but, a limitation of its representation. In this specific case, we are searching for
a particular IMF, so, we need to tackle the problem of Mode-mixing.
- 18 -
6. Ensemble Empirical Mode Decomposition
Ensemble Empirical Mode Decomposition [3] is a noise assisted data analysis to take care of
mode-mixing. A white Gaussian noise is added to the input speech signal to avoid mode mixing.
The same experiment is repeated N (>>1) times using N different sequences of noise. The
corresponding IMFs from these N experiments are added. Because, the noise is random, it
becomes negligible compared to the signal. Hence, we get only the signal component, ideally. We
can thus avoid mode-mixing in Empirical Mode Decomposition.
Figure 7. Input Signal Concatenation of two sinusoids Figure 8. EMD Single IMF contains
two modes
Fig. 9 : EEMD : Highest Frequency IMF Fig. 10: EEMD : Lowest Frequency IMF
- 19 -
The EEMD algorithm [8] gets rid of the mode mixing defining the true IMF components as the
mean of certain ensemble of trials, obtained by adding noise of finite variance to the input signal.
Although the method based on autocorrelation has been reported to be the best pitch estimation
technique for the analysis of pathological sustained vowel /a/, it has been shown [9] that it fails
along with the RAPT algorithm, while the method using Ensemble Empirical Mode
Decomposition gives better results.
In the figures above, the first plot is that of a synthetic signal which is a concatenation of two
sinusoids. In case of Empirical Mode Decomposition which captures the highest frequency
component at each instant of time, the Intrinsic Mode Frequency contains both the sinusoids. But,
in the case of Ensemble Empirical Mode Decomposition, we get two different Intrinsic Mode
Frequencies containing one sinusoid each as desired. As we can observe, the mode-mixing has
been tackled by Ensemble Empirical Mode Decomposition nicely. We need a particular Intrinsic
Mode Frequency to contain a specific frequency component and we can achieve this using
Ensemble Empirical Mode Decomposition. But, there are a few limitations of Ensemble
Empirical Mode Decomposition. The EEMD Intrinsic Mode Frequencies do not satisfy the
property of IMFs unless the N, the number of trials is very high. To reduce the variance of the
estimate, we need to average it over a number of estimates [7]. Also, the noise strength reduces
only when N is high. We found out that, the higher the value of N (of the order of thousands), the
better the output. But, this leads to N-fold increase in resource consumption (mainly time). Also,
due to addition of noise, the execution of a single trial also becomes costly. Hence, it consumes
very high run-time. Then, we experimented with varying values of noise and found out that if the
noise energy is very low, EEMD acts as normal Empirical Mode Decomposition and doesn’t help
much to avoid “Mode-mixing”. So, we need to find an alternative to tackle “Mode-mixing”.
- 20 -
7. Neighbourhood Limited Empirical Mode Decomposition
Neighbourhood Limited Empirical Mode Decomposition [4] is another method to avoid 'Mode-
mixing' in Empirical Mode Decomposition. Usually, the frequency mixing occurs in the above
mentioned algorithm, due to picking the highest frequency component in each locality. So, we
need to restrict the frequency span of a particular IMF. This can be done by limiting the distance
between two consecutive extrema. We can add a spurious extrema when the distance between
two consecutive extrema is large. This ensures that the frequency in a particular frequency mode
is limited, but, the amplitude can vary in a particular frequency mode. Thus, we have ensured that
frequencies have been efficiently separated into corresponding frequency modes.
Figure 11. Input Signal : x=1:1000;y=[sin(0.1*x) sin(0.8*x)]; Figure 12.: EMD : Single IMF
contains 2 modes
Figure 13. NLEMD : Highest Frequency IMF Figure 14. NLEMD : Lowest Frequency
IMF
X-axis is time, Y-axis is Amplitude
- 21 -
If we use strict Neighbourhood Limited Empirical Mode Decomposition before filtering the
signal, all the Intrinsic Mode frequencies will contain only frequencies in the pitch region. Then,
we will not be able to decide the best candidate for pitch information using filtering. If we filter
the signals initially and then, perform Neighbourhood Limited criterion, much of the pitch
information will get leaked to other Intrinsic Mode Frequencies and we will fail to tackle
“Mode-mixing”.
The resource consumption for Neighbourhood Limited Empirical Mode Decomposition increases
as the entropy in the time-frequency representation increases. So, the resource consumption is
lower for sinusoids, higher for speech and even higher for noisy speech. To decrease the
resources consumed, we can begin the Neighbourhood Limited criterion at the Intrinsic Mode
Frequency where the pitch information begins to make its presence. Thus, the resource
consumption will be reduced and the performance will be almost intact. Another application of
this Neighbourhood Limited Empirical Mode Decomposition can be “Speech Enhancement”, as
the frequency resolution of Hilbert Huang Transform is high.
Hence, we have designed an algorithm which will capture the pitch information, if it has any, in
the adjacent Intrinsic Mode Frequencies, else, it will capture other frequencies. So, we have
ensured that the particular Intrinsic Mode Frequency contains the pitch information wherever the
pitch information exists and has different frequencies in other regions. So, we perform the
modified novel Neighbourhood Limited Empirical Mode Decomposition initially. Then, we can
proceed with our filtering.
- 22 -
8. Pitch Extraction using Empirical Mode Decomposition
The pitch is a prominent part of speech (and the speaker). It is used extensively in Speech
Coding, Speaker Recognition and a few Speech Recognition systems. Many techniques exist to
calculate the pitch - Autocorrelation method, Cepstral methods, AMDF, etc. But, all of these
techniques face a few or all of these problems- windowing effect, low time resolution, low
frequency resolution, etc.
This research work is an attempt to get rid of a few or all of these shortcomings. We can use
Empirical Mode Decomposition to find the instantaneous pitch. The idea is that one of the
Intrinsic Mode Frequencies contains the pitch information. To make sure that there is a unique
Intrinsic Mode Frequency containing the pitch information, we need to get rid of “Mode-mixing”.
We can use Neighbourhood Limited Empirical Mode Decomposition for this purpose and we
have observed that it serves the purpose. We also need to filter the signals. The main purpose that
this filtering serves is to determine the best candidate for the pitch information. The Intrinsic
Mode Frequencies are filtered and the ratio of the energies of the signal that comes out as the
output of the filter and the signal that is passed as the input to the filter is used to determine the
best candidate. The best candidate will get passed through the filter to the highest extent and other
Intrinsic Mode Frequencies will be highly attenuated which is also clear from the given figures.
Another purpose it serves is to determine the voiced portions. We can use the short-time ratio of
the signal that comes as the output of the filter to the signal given as the input to the filter as a
parameter to determine the voicedness of the speech. Only in voiced regions, the energy of the
signal passes through the filter to the maximum extent. The filter attenuates the signal in other
regions. This can also be used as a feature in various tasks such as “Digit Recognition”, “Speech
Recognition” among others.
So, initially, all the IMFs are computed using Neighbourhood Limited Empirical Mode
Decomposition. Then, the estimated pitch is calculated using auto-correlation method. A narrow-
band 'Band-Pass Filter' is generated with the estimated pitch as the centre frequency. The IMFs
are normalized, so that, they can later be compared. All the IMFs are then passed through the
filter. The IMF having the highest energy is proposed as the IMF containing the pitch
information.
- 23 -
Figure 15. The speech signal and the corresponding IMFs
- 24 -
Figure 16. Filtered IMFs along with the speech signal
It can be observed that IMF 6 contains the pitch information and has the highest fraction of
energy passed through the filter
- 25 -
Now, let us consider an example of a non-voiced and a voiced signal (/s/ /e/). The utterance
contains two phonemes – the fricative ‘s’ and the vowel ‘e’. The speech signal and the
corresponding IMFs are plotted.
Figure 17. The speech signal and the corresponding IMFs determined by modified
Neighbourhood Limited Empirical Mode Decomposition
- 26 -
The amplitude of the Filtered IMF6 is high in the voiced region and is close to zero in the non-
voiced part. So, its envelope can be used as a distinguisher. A proper threshold is placed on the
envelope to make a decision. The below plots show that the 6th Intrinsic Mode Frequency
contains the pitch information. It is evident more clearly from the quantitative analysis of the
Filtered Intrinsic Mode Frequencies. To find out which IMF contains the pitch information, we
filter all the IMFs through a suitable filter and determine the fraction of energy that passes
through the filter. The IMF whose energy passes the maximum is the best contestant for the pitch
information. These fractions also represent the confidence of the IMF chosen. The fraction should
be as large as possible for the IMF that will be chosen and as low as possible for others. In this
example, we can observe that IMF6 can be ranked SIXTH among the IMFs with respect to
amplitude or Energy-content. But, Filtered IMF6 ranks FIRST among the Filtered IMFs w.r.t.
Amplitude or Energy-content. Also, we can observe that Filtered IMF6 hardly contains any
Mode-mixing or a mixing of more than one frequency mode.
Figure 18. The speech signal and the corresponding Filtered IMFs obtained using EMD,
Neighbourhood Limited Criterion and Filtering
- 27 -
The plots of the speech signal, its Filtered IMF6 and the envelope of the Magnitude of Filtered
IMF6 is shown for about 250 msec.s. We can observe that the local amplitude of the Intrinsic
Mode Frequency containing pitch information represents the voicedness of the speech. We can
use it to determine whether a particular region is voiced or not. This is done by a threshold on the
maxima envelope of the Intrinsic Mode Frequency in this case. The figure below depicts this. The
red line in the third subplot represents the threshold. The envelope can also be used for various
applications including, but, not limited to “Digit Recognition”.
IMF6 and the Threshold in red
- 28 -
We also tested the method for epoch determination. For this purpose, the Zero crossings with a
positive slope are found out. These are the Epochs. The epochs obtained using this method are
compared to those obtained using Zero Frequency Filter method [5]. The average accuracy for a
set of speech files containing about 5000 epochs is 96.43% with a Standard Deviation of. about
0.4 msec.s (Sampling frequency, Fs = 16kHz).
IMF6 and the Threshold in red, (d) Epochs determined by our novel HHT based method and (e)
Epochs determined by ZFF method
- 29 -
The pitch is found out from the filtered Intrinsic Mode Frequencies using Hilbert Spectral
Analysis. This is done by first computing the Hilbert Transform of the signal. Then, the original
signal, i.e. the filtered Intrinsic Mode Frequency is used as the real part of a complex signal and
its Hilbert Transform as the complex part. The frequency at each sampling point is obtained by
scaling the differentiation of the instantaneous phase of this complex signal. The instantaneous
frequency (as Frequency-Time representation) is obtained for two speech signal portions along
with the Epochs and is plotted in the following figures. Also, the Epochs obtained by Zero-
Frequency Filtering [6] method are plotted.
Figure 21. (a) Magnified view of the Speech signal, (b) Filtered IMF6, (c) Time-frequency
representation of IMF6, (d) Epochs determined by our novel HHT based method and (e) Epochs
determined by ZFF method
- 30 -
We can observe that for each Epoch determined by the Zero-Frequency Filtering method, we
have a corresponding Epoch determined by our novel algorithm based on modified
Neighbourhood Limited Empirical Mode Decomposition as desired. Also, we can observe that
the distance between adjacent Epochs determined by the Zero-Frequency Filtering method is
equal to the distance between the corresponding Epochs determined by our novel method.
Figure 22. (a) Magnified view of the Speech signal, (b) Filtered IMF6, (c) Time-frequency
representation of Filtered IMF6, (d) Epochs determined by our novel HHT based method and
(e) Epochs determined by ZFF method
- 31 -
9. Potential applications
 The pitch extracted can be used as a feature in speaker recognition
 It can also be used as one of the parameters in Speech coding
 Other applications include biometric identification, Bio-medical applications, signal

processing research, etc.
 This will help future research in Signal Processing, especially in the field of Spectral
representation of signal.
 Other products of this algorithm are very useful in many cases. An example is “Digit
Recognition” where the envelope of the suitable Intrinsic Mode Frequency increases the
accuracy
- 32 -
10. Results and Conclusions
In this report we have explained and tested various methods to enhance the decomposition of
speech signal into Intrinsic Mode Frequencies and to extract pitch using Hilbert Huang
Transform. We can conclude that the Empirical Mode Decomposition is a good method for
decomposing speech signal into IMFs. But, mode-mixing disrupts the assumption that an IMF
contains a single frequency mode. So, we need to avoid mode-mixing in applications which rely
on the assumption that the IMF has a single mode. Hence, we introduced Ensemble Empirical
Mode Decomposition which correctly avoids mode-mixing using the principle of Noise-assisted
data analysis. But, the time required by Ensemble Empirical Mode Decomposition is very high, N
times that of normal EMD, where N is of the order of a few thousands. Hence, we need some
alternative method to avoid mode-mixing without affecting other properties of Empirical Mode
Decomposition. We found that Neighbourhood Limited Empirical Mode Decomposition
combined with filtering stands up to our expectations. It effectively gets rid of mode-mixing in
Empirical Mode Decomposition and has runtime comparable to that of Empirical Mode
Decomposition. We have obtained very good results using this novel algorithm. The accuracy for
a dataset is found to be 96.43% with a Standard Deviation of 0.4 msec.s. Also, other products of
this algorithm have a variety of applications.
- 33 -
11. Future Work
1) To explore the uses of instantaneous pitch
2) To exploit the algorithm for other signals including Biomedical signals
3) To test the algorithm on a larger dataset
4) To increase the efficiency of the algorithm with respect to run-time (Can implement the
algorithm in C or C++)
5) To test the novel algorithm for Intrinsic Mode Frequency analysis for various applicatios
- 34 -
12. References
[1] Norden E. Huang, Zheng Shen, Steven R. Long, Manli C. Wu, Hsing H. Shih, Quanan
Zheng, Nai-Chyuan Yen, Chi Chao Tung and Henry H. Liu, "The Empirical mode
decomposition and the Hilbert spectrum for nonlinear and non-stationary time series
analysis," Proc. Royal Society London A, vol. 454, pp. 903-995, 1998.
[2] Norden E. Huang, Samuel S.P. Shen, Hilbert-Huang transform and its applications,
London : World Scientific, c2005.
[3] G. Schlotthauer, M. E. Torres, and H. L. Rufiner, “A new algorithm for instantaneous F0

speech extraction based on ensemble empirical mode decomposition,” Proc. European
Signal Processing Conference, Glasgow, Scotland, August 24-28, 2009.
[4] Guanlei Xu, Xiaotong Wang, Xiaogang Xu, "Neighborhood Limited Empirical Mode
Decomposition and application in Image Processing," Proc. Fourth International
Conference on Image and Graphics, pp.149-154, 2007
[5] K. S. R. Murty and B. Yegnanarayana, “Epoch extraction from speech signals,” IEEE
Trans. Audio, Speech and Language Process., vol. 16, no. 8, pp. 1602–1614, Nov. 2008.
[6] S. R. M. Prasanna, D. Govind, K. Sreenivasa Rao and B. Yegnanarayana, "Fast prosody

modification using instants of significant excitation,” Proc. Speech Prosody, Chicago,
USA, May 2010
[7] R.M. Rangayyan, Biomedical Signal Analysis -A Case-Study Approach, IEEE and
Wiley, New York, pp. 289, 2002
[8] Wu Z, Huang N, “Ensemble empirical mode decomposition: a noise-assisted data

analysis method.” Advances in Adaptive Data Analysis, vol. 1, no. 1, pp. 1-41, 2009
[9] G. Schlotthauer, M. E. Torres, and H. L. Rufiner., “Voice fundamental frequency

extraction algorithm based on ensemble empirical mode decomposition and entropies.”
Proc. 11th Int. Congr. of the IFMBE, Munich, pp. 984–987, 2009
- 35 -

1 BTP Report 07010245 Suraj S Sheth16april2011

Uploaded by

Copyright:

Available Formats

1 BTP Report 07010245 Suraj S Sheth16april2011

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 BTP Report 07010245 Suraj S Sheth16april2011

Uploaded by

Copyright:

Available Formats

Extraction of Pitch from

Speech Signals using

It is certified that the work contained in the report entitled

Department of Electronics and Communication Engineering,

Indian Institute of Technology Guwahati,

Guwahati – 781039, INDIA

I feel it a great privilege in expressing my deepest and most sincere

I am very grateful to the non-teaching staff and students of the

4. Empirical Mode Decomposition .................................................................................................. 12

5. Mode - Mixing ............................................................................................................................. 17

6. Ensemble Empirical Mode Decomposition ............................................................................... 169

7. Neighbourhood Limited Empirical Mode Decomposition .......................................................... 21

8. Pitch Extraction using Empirical Mode Decomposition ............................................................. 23

9. Potential Applications ................................................................................................................ 29

10. Results and Conclusions ........................................................................................................... 30

11. Future Work............................................................................................................................... 31

12. References ................................................................................................................................. 32

Figure 7. Input Signal Concatenation of two sinusoids. ............................................................................ 19

Figure 8. EMD Single IMF contains two modes. ...................................................................................... 19

Figure 9 : EEMD : Highest Frequency IMF. ............................................................................................ 19

Figure 10: EEMD : Lowest Frequency IMF .............................................................................................. 19

Figure 11. Input Signal : x=1:1000;y=[sin(0.1*x) sin(0.8*x)]; .................................................................. 21

Figure 13. NLEMD : Highest Frequency IMF ........................................................................................... 21

Figure 14. NLEMD : Lowest Frequency IMF ............................................................................................ 21

Table 1: Comparison of Frequency Representation techniques…………………………………15

Time-frequency representation is an important component of Signal Processing and has potential

Hilbert Huang Transform is a time-frequency representation technique developed recently by N E

 Testing a set of synthetic signals to understand the features of Empirical Mode

 Exploring the applications of Hilbert Huang Transform to Speech Signals

 Codes for Neighbourhood Limited Empirical Mode Decomposition algorithm

 Literature Survey for pitch extraction algorithms

 Location of Epochs using Empirical Mode Decomposition and comparing it to the

 Optimising Ensemble Empirical Mode Decomposition for reducing time complexity.

 Application of Neighbourhood Limited Empirical Mode Decomposition to synthetic and

 Developing a novel combination of filtering and Neighbourhood Limited Empirical

 Comparing the proposed novel epoch detection algorithm to Zero-Frequency Filter

The algorithm can be described as:

Step 2: To find out the (i+1)th Intrinsic Mode Frequency

(a) Initially assume hi(k-1) = zi and k=1;

(b) Find out the local extrema of hi(k-1);

hi(k) = hi(k-1) - mi(k-1)

Step 3: Define z(i+1) = z(i) - hi(k)

Fig. 1: Speech Signal along with Maxima and Minima

hi(k)(t) is a potential IMF.

5) Check whether hi(k)(t) satisfies the properties required to be an IMF :

The Standard deviation threshold is usually set at a number between 02 to 0.3.

Basis a priori a priori adaptive

Frequency Not local Not local Local, instantaneous

Presentation Energy-Frequency Energy-Time-Frequency Energy-Time-Frequency

Non-stationary NO YES YES

Table 1: Comparison of Frequency Representation techniques

X-axis is time, Y-axis is Amplitude

 The pitch extracted can be used as a feature in speaker recognition

 It can also be used as one of the parameters in Speech coding

 Other applications include biometric identification, Bio-medical applications, signal

1) To explore the uses of instantaneous pitch

2) To exploit the algorithm for other signals including Biomedical signals

3) To test the algorithm on a larger dataset

[3] G. Schlotthauer, M. E. Torres, and H. L. Rufiner, “A new algorithm for instantaneous F0

Figure 11. Input Signal : x=1:1000;y=[sin(0.1x) sin(0.8x)]; .................................................................. 21