0% found this document useful (0 votes)
605 views6 pages

Pitch Extraction

This paper describes the pitch tracking techniques using autocorrelation method and AMDF method involving the preprocessing and the extraction of pitch pattern. Speech Recognition System of tonal language use pitch tracking for tone recognition. Pitch is also crucial for prosodic variations in text-to-speech systems and spoken language systems.

Uploaded by

Sheela Shivaram
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
605 views6 pages

Pitch Extraction

This paper describes the pitch tracking techniques using autocorrelation method and AMDF method involving the preprocessing and the extraction of pitch pattern. Speech Recognition System of tonal language use pitch tracking for tone recognition. Pitch is also crucial for prosodic variations in text-to-speech systems and spoken language systems.

Uploaded by

Sheela Shivaram
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

PITCH DETECTION ALGORITHM: AUTOCORRELATION METHOD AND AMDF

Li Tan and Montri Karnjanadecha

Department of Computer Engineering


Faculty of Engineering
Prince of Songkhla University
Hat Yai, Songkhla
Thailand, 90112
E_mail: [email protected], [email protected]

ABSTRACT detection, if we assume x(n) is exactly periodic with period P,


This paper describes the pitch tracking techniques using i.e., x(n) = x(n + P) for all n, then it is easily shown that:
autocorrelation method and AMDF (Average Magnitude
Difference Function) method involving the preprocessing and Rx(m) = Rx(m + P), (3)
the extraction of pitch pattern. It also presents the
implementation and the basic experiments and discussions. i.e., the autocorrelation is also periodic with the same period.
Conversely, periodicity in the autocorrelation function
KEYWORDS indicates periodicity in the signal.
Pitch, Pitch Detection Algorithm, Autocorrelation function,
Speech Recognition System, Center-clipping, Pitch Contour For a nonstationary signal, such as speech, the concept of a
long-time autocorrelation measurement as given by (2) is not
1. INTRODUCTION really meaningful. Thus, it is reasonable to define a short-time
Pitch detection is very important for many speech processing autocorrelation function, which operates on short segments of
algorithm. Speech recognition system of tonal language use the signal as:
pitch tracking for tone recognition, which is important in
N '−1
disambiguating the myriad of homophones. Pitch is also
crucial for prosodic variations in text-to-speech systems and R x (m ) = 1
N ∑ [x (n + l )w (n )][ x (n + l + m )w (n + m )],
n =0
spoken language systems. The fundamental frequency (F0) is
the main cue of the pitch. However, it is difficult to build a o ≤ m ≤ M0 (4)
reliable statistical models involving fundamental frequency
F0 because of pitch estimation errors and the discontinuity of where w(n) is an appropriate window for analysis, N is the
the F0 space. Thus, a reliable pitch detection algorithm (PDA) section length being analyzed, N' is the number of signal
is a very important component in many speech processing samples used in the computation of R(m), Mo is the number of
systems. autocorrelation points to be computed, and l is the index of the
starting sample of the frame. For pitch detection applications
In the paper, the principles of the two pitch detection N' is generally set to the value in (5):
algorithms, preprocessing and the extraction of pitch pattern
techniques are introduced. The implementation of them is N'=N-m (5).
described. Then the experiments and discussions are
presented. Finally it’s the conclusions of the paper. So that only the N samples in the analysis frame (i.e., x(l), x
(l+1), . . . , x(l + N - 1)) are used in the autocorrelation
2. BACKGROUND computation. Values of 200 and 300 have generally been used
for Mo and N , respectively, it is corresponding to a maximum
2.1 Autocorrelation Method and AMDF pitch period of 20 ms (200 samples at a 10 kHz sampling rate)
Basically, pitch detection algorithms use short-term analysis and a 30 ms analysis frame size. [1,3]
techniques. For every frame xm we get a score f(T| xm) that is a
function of the candidate pitch periods T. Algorithm A variation of autocorrelation analysis for measuring the
determine the optimal pitch by maximizing (1). periodicity of voiced speech uses the AMDF, defined by the
relation in (6):
Tm = argmax f (T | x m ) (1) L
T
Dm = 1L ∑
n
| x (n ) − x (n − m) |,
=1
m = 0,1,...mmax (6)
A commonly used method to estimate pitch is based on
detecting the highest value of the autocorrelation function in Where x(n) are the samples of input speech and x(n-m), are
the region of interest. Given a discrete time signal x(n), the samples time shifted m seconds. The vertical bars denote
defined for all n , the auto-correlation function is generally taking the magnitude of the difference x(n) – x(n-m) Thus a
defined in (2): difference signal Dm, is formed by delaying the input speech
various amounts, subtracting the delayed waveform from the
N original, and summing the magnitude of the differences
R x (m ) = lim 1
N → ∞ 2 N +1 ∑ x ( n ) x (n + m )
n = −N
(2) between sample values. The difference signal is always zero
at delay = 0, and is particularly small at delays corresponding
to the pitch period of a voiced sound having a quasiperiodic
structure.
The autocorrelation function of a signal is basically a (non-
invertible) transformation of the signal that is useful for The AMDF [4] is a variation of ACF (Autocorrelation
displaying structure in the waveform. Thus, for pitch Function) analysis [1] where, instead of correlating the input
speech at various delays (where multiplications and
summations are formed at each value), a difference signal is where CL is the clipping threshold. Generally CL is about 30%
formed between the delayed speech and the original, and at of the maximum magnitude of signal. In application the CL
each delay value the absolute magnitude is taken. Unlike the should be as high as possible. To get the high CL, we can
autocorrelation or cross-correlation function, however, the catch the peak value of the first 1/3 and the last 1/3 of signal
AMDF calculations require no multiplications, a desirable and use the less one to be the maximum magnitude. Then we
property for real-time applications. set the 60-80% of this maximum magnitude to be CL.

For each value of delay, computation is made over an The effect of center-clipping and infinit-peak-clipping is
integrating window of N samples. To generate the entire range clearly shown in the Fig. 1 (a, b, c). From Fig. 1 (b), after
of delays, the window is “cross differenced” with the full center-clipping, the autocorrelation only leave several pulse
analysis interval. An advantage of this method is that the that show the reduction of the confused secondory peak. From
relative sizes of the nulls tend to remain constant as a function Fig. 1 (c), the first peak is very clear. Also the secondary
of delay. This is because there is always full overlap of data peak
between the two segments being cross differenced.

In extractors of this type, the limiting factor on accuracy is the


inability to completely separate the fine structure from the
effects of the spectral envelope. For this reason, decision logic
and prior knowledge of voicing are used along with the
function itself to help make the pitch decision more reliable.
[1,4]

2.2 Preprocessing Technique (a) x(n)


From above, we know the autocorrelation function and
AMDF can be used to detect the pitch. However the speech
signal include very rich harmonic components. The minimum
F0 is about 80 Hz and the maximum is about 500 Hz. Most of
them are in the range of 100-200 Hz. Thus the signal may
involve 30-40 harmonic components. And the F0 component
is often not the strongest one. Because the first formant
usually is between 300-1000 Hz. That is, the 2-8 harmonic
(b) clc[x(n)]
components usually stronger than fundamental component.
The rich harmonic components let the pitch tracking become
very complex. It usually has the harmonic errors and sub-
harmonic errors. To improve the reliability some pre-
processing of signal is necessary.

Since, the range of F0 is generally in the range of 80-500 Hz,


then the frequency components above 500 Hz is useless for
pitch detection. Thus a low-pass filter with pass-band (c) sgn[x(n)]
frequency above 500 Hz would be useful in improving the
performance of pitch detection. Generally, we use the low- Fig. 1 the autocorrelation of x(n), clc[x(n)], sgn[x(n)]
pass-filter with 900 Hz. (adapted from [5])

Also to reduce the effects of the formant structure on the value is reduced. All of these show that the center-clipping
detailed shape of the short-time autocorrelation function, the and infinite-clipping is effective in reduce the effects of the
nonlinear processing is usually used in pitch tracking. formant structure [2,3,5].

Y(n)=C[x(n)] (7) 2.3 Post-processing


Generally, the pitch determination described above is still
One of the nonlinear technique is center-clipping of speech error-prone. The erroneous voiced/unvoiced decisions and
which is first introduced by M. M. Sondhi [3]. The relation inaccurate voiced pitch hypotheses can lead to noisy and
between input x(n) and y(n) is: undependable feature measurements. Then a smoothing stage
is necessary in improving the performance of the system. The
most common smoothing techniques includes: median filter,
 (x (n ) − C L ), x (n ) ≥ C L linear smoothing and dynamic programming technique.
 According to the reliability of pitch tracking algorithm,
y (n ) = clc [x (n )] =  0, | x (n ) |< C L (8)
 (x (n ) + C ), x (n ) ≤ −C generally the median-filter is used. In the method of median-
 L L filter, it uses a moving window with the length L. The value at
Another nonlinear clipping we call is infinite-peak-clipping. point n is determined by the data from point n-L to point n+L.
The function is described in (9): Then the median value in these 2L+1 points is chooses as the
value the point. [3]

2.4 Feature Extraction


 1, x (n ) ≥ C L After getting the smoothed pitch contour, we fit it into the 3rd

y ( n ) = sgn[ x ( n )] =  0 , | x ( n ) |< C L (9) order polynomial using least-mean-square or project it onto
 − 1 , x ( n ) ≤ −C some basis functions using Orthogonal polynomial
 L
approximation.
For least-mean-square approximation, it express the 3. IMPLEMENTATIONS
approximation function as the sum of weighted observation
value. 3.1 Modified Autocorrelation Method
N According to the discussion above, the modified
f LMS = ∑
i
ai x i
=1
(10)
autocorrelation pitch detector based on the center-clipping
method and infinite-clipping is used in our implementation.
Fig. 2 shows a block diagram of the pitch detection algorithm.
The method requires that the speech be low-passed filtered to
Where fLMS is the estimated function, aj is the weighted 900 Hz. The low-pass filtered speech signal is digitized at a
coefficients, xI is the observation item (x1=1,x2=x,x3=x2, 10-kHz sampling rate and sectioned into overlapping 30-ms
x4=x3). Then it is to minimize the expectation value of (300 samples) sections for processing. Since the pitch period
approximation error (e=fLMS-f) to get the weighted coefficients computation for all pitch detectors is performed 100 times/s
ai. In order to minimize the expectation value, we need to get i.e, every 10 ms, adjacent sections overlap by 20 ms or 200
the derivation of the expectation value and set it to zero. Then samples. The first stage of processing is the computation of a
the coefficients will be calculated through equation (11). clipping threshold CL for the current 30-ms section of speech.
The clipping level is set at a value which is 68 percent of the
N
smaller of the peak absolute sample values in the first and last

j=1
E (x i x j )a j = E ( fx i ) (11) 10-ms portions of the section. Following the determination of
the clipping level, the 30-ms section of speech is center
clipped, and then infinite peak clipped. Following clipping
Orthogonal polynomials are defined in terms of their behavior the autocorrelation function for the30-ms section is computed
with respect to each other and throughout some predetermined over a range of lags from 20 samples to 160 samples (i.e., 2-
range of the independent variable. In the case of the vectors, if ms-20-ms period). Additionally, the autocorrelation at 0 delay
the set was completed it was said to span a vector space and is computed for voiced/unvoiced determination. The
any vector in that space could be expressed as a linear autocorrelation function is then searched for its maximum
combination of orthogonal basis vectors. The first four value. If the maximum exceeds 0.55 of the autocorrelation
discrete Legendre polynomials can be chosen to represent the value at 0 delay, the section is classified as voiced and the
pitch contour. They are shown in equation (12): location of the maximum is the pitch period. Otherwise, the
section is classified as unvoiced.
Φ0 ( Ni ) = 1

(N ) [12N ×+N2 ]1 2[(Ni ) − 12 ]


Segment
Φ1 i =
12 LPF (0—900 HZ)

()
Φ2 Ni

=
180N 3 

 (N − 1)(N + 2)(N + 3) 
[
( Ni )2 − ( )+ ]
i
N
N −1
6×N
Calculate CL
12
 
()
Φ3 i = 
N
2800N 5

 (N − 1)(N − 2)(N + 2)(N + 3)(N + 4)  Center-Clipping (C) and infinite-
peak-clipping (V)

( ) 3 − 23 (Ni ) 2 + 6N10−×3NN +2 (Ni ) − (N20−1×)(NN −2) 
•  Ni

3
2 2

(12) Energy of Center-clipped signal

These polynomials are normalized in length to [0,1]. Where i


is from 0 to N, N+1 is the length of pitch contour and N Autocorrelation Function
should be bigger than 3. Legendre polynomials is a kind of
Orhtogonal polynomials with the simplest weight function
which is equal to 1. They are chosen to represent the pitch Compare the max correlation and
contour because they resemble to the basic pitch contour 0.55*ENG
patterns. A pitch contour segment f(i/N), can then be

∧ 3 Less: pitch=0 or pitch= index


f ( Ni ) = ∑
j
a j × Φ j ( Ni ),
=0
0 ≤ i ≤N (13) of maximum autocorrelation

Pitch Value
approximated as (13):
Where
Fig 2. Block Diagram of Pitch Detection Algorithm using Modified
N Autocorrelation Method
1 i i
aj =
N +1 ∑
i =0
f ( )×Φj( )
N N 3.2 AMDF
We only implement a coarse quantization. We leave the
The reconstructed pitch contour will not lose much
voice/unvoiced detection and the decision logic as the further
information since orthogonol polynomials up to degree of
work. Fig. 3 shows a block diagram of the AMDF pitch
three are used to fit it. [7,8]
detector. The speech signal, is initially sampled at 10 kHz.
Then the signal pass a low-pass filter (0-900 Hz) and set the 4.2 Autocorreltion and AMDF on Continuous Speech
first 20 samples to be zero. The clipping threshold is then To observe the difference between AMDF and
calculated and the center-clipping is done on the signal. Then Autocorrelation method, we test both of them through a Thai
average magnitude difference function is computed on the continuous digit “07229”, which is shown in Fig. 5. The pitch
center-clipped speech signal at the lag (20—140 samples) is shown in Fig. 7. From the figure, the pitch information
through the signal from 20 to 160 samples. The pitch period is mainly lies on the voiced part in the speech signal. In the
identified as the value of the lag which the minimum AMDF silence part of the pitch is shown as the big variation. In the
occurs. Thus a fairly coarse quantization is obtained for the voiced part the pitch tracking show continuously and
pitch period. smoothly. Then the voiced/unvoiced decision is proved to be a
very important part of pitch detection. Also although the pitch
Segment track shown in Fig. 6 can describe the trend of the pitch, it still
exists some error points which need the further processing,
LPF (0 – 900 Hz) that we say, smoothing. Also in Fig. 7, it shows both results
for Autocorrelation method and AMDF. We can see that both
methods can give us accepted result.
Clipping threshold CL

AMDF

Pitch = Index of min


(AMDF)
Fig 5. Waveform of Thai Digit(“07229”)
Pitch Value

Fig. 3 Block Diagram of the Coarse Pitch Detection using AMDF

Also the 5-point median-filter and feature extraction using


LMS and Orthogonal polynomials are implemented according
to the introduction above. Fig. 6 Waveform of Mandarin Speech “hao” with 3rd tone

4. EXPERIMENTS AND DISCUSSION 4.3 Voice/Unvoiced Decision


In the implementation of autocorrelation method, we use the
4.1 Experiments Setting 0.55 of the frame energy as the threshold to detect the
The experiments that we have done mainly include two parts. voice/unvoice decision. Fig. 8 shown the experiment’s results
First part is emphasis on the observation of the results of these of it. From Fig. 8, it can detect the voiced part of the speech
two pitch detection algorithms. And the pre-processing effects basically although some decision logics need to be further
like the processing of low-pass-filter and center-clipping. The considered.
voiced/unvoiced determination in autocorrelation method is
also tested. The speech that we used in our experiments is 4.4 Smoothing
from Thai continuous digit database. Here for observing the From the above figure, we know, the smoothing of the pitch
effects, we have done the above experiments for some contour is necessary after a single pitch contour got. Here we
speeches from the database. Considering almost all of them choose a single speech word to be the object. Generally the
shows the similar results. Here we only use one continuous pitch of 3rd tone in Mandarin is more difficult than other tone
speech with information “07229” and one single Madarine because of its big variations. Here we choose the Mandarin
speech “hao(3)” which is considered more difficult in pitch word “hao” with 3rd tone in the experiment. The waveform is
tracking because of its big variation. Second part is worked on shown in Fig. 9. Also we use the median filter to smooth the
a small database which is based on 4-continuous-Thai-digit pitch using Autocorrelation method. Fig 9 shows the effective
sentence. The sentences are chosen according to the general of median filter. But the median-filter can’t delete the several
distribution of 3 tones in Thai digit. It includes 14 sentences continuous error points. Also the difference between two
with 23 1st tone, 10 2nd tone and 23 4th tone. To consider this continuous frames is examined. If it is greater than a
is only for testing, we record the sound in the office predetermined threshold, the one lie farther away from the
environment. Sampling frequency is 16K Hz. And we collect mean is treated as error and modified.
4 male’s sound and 2 rounds per person. Finally we get 112
speeches. All of the speech is hand labeled with the wave- 4.5 Effects of Pre-processing
surfer software. In our testing, we use the 1st round speech of Also in order to observe the effects of low-pass-filter and
each person as training set. And the following are as the center-clipping, we did the experiments on the speech “hao
testing set. We use our implementation to detect the pitch (3)”. The results is shown in Fig. 10 and Fig. 11. From Fig. 10
contour and extract the pitch feature. Then we use a three- which is using AMDF algorithm, we don’t find the big effect
layer feedforward NN (neural network) with 4 inputs and 5 of the pre-processing here. We consider that it might the
outputs as the framework. The hidden layer and training epocs effects of formant structure in AMDF method is not big. But
is determined by the test. the pre-processing reduced the data and then increased the
processing speed. But the effects of LPF in autocorrelation
method are quite clear and shown in Fig. 11 In Fig. 11, the
error points reduced from 10 to 4 after adding the processing
of LPF.

Fig. 11 Effects of LPF in autocorrelation method


(up: LPF down: no LPF)

Fig. 7 Pitch Track Using Autocorrelation Method and AMDF


(up: AMDF down: Autocorrelation) 4.6 Feature Extraction
Pitch information mainly lies on the trend of pitch contour. As
introduced above, two methods, LMS and Orthogonol
polynomial, are used to extract the pattern of pitch contour.
The experiment is shown in Fig. 12. According to the figure,
both of them are working well. But finally which one can get
better performance in recognition system needs the further
research and experiments. Also Fig. 13 shows the shape of the
four discrete Legendre bases for the space of pitch contour
length. From Fig. 13, we can see that the four discrete
Legendre bases are quite resemble to the basic pitch contour.

Fig. 8 the Voice/Unvoiced Detection in Autocorrelation


method(up: no V/UV down: V/UV)

Fig. 12 The pitch pattern extracted


Fig. 9 Smoothing Pitch contour of “hao(3)” using median- 0(up: original-pitch mid: LMS bottom: Legendre
filter Polynomials)

Fig. 13 Four Discrete Legendre bases


Fig. 10 Effects of Pre-processing technique using AMDF (Point: base 1 *: base 2 ∆: base 3 o: base 4)
(up-left: LPF+clipped up-right: no LPF+clipped down-left:
LPF+no clipped down-right: no LPF+no clipped)
4.7 Classification technique in both techniques, we didn’t find the big effects of
This is the second part of our experiments. Feature extraction preprocessing on AMDF. But the obvious effects of low-pass-
using the implementation described above and classifier using filter is shown in the experiment using autocorrelation
NN are used. According to our observation, we use the method. At the same time, we have tested the smoothing using
autocorrelation pitch detection and orthogonal polynomial in median-filter and voiced/unvoiced decision in autocorrelation
out testing. All feature vectors are normalized to lie between – method. Both of them showed the positive results. Finally, we
used two methods to extract the pitch pattern through the
 F i − min F i  smoothed pitch contour. According to the experiments figure,
normF = 2 . 0 × 
i  − 1 .0
 (14)
both of them works quite well. But in this case we need the
 max F i − min F i 
quite smoothed pitch segment. For pitch detection the
1.0 and 1.0 using the min-max normalization.
voice/unvoiced determination and the segmenting of pitch
contour are another important issue that we didn’t discussed
The total percentage of the testing is 79.02% (177 from 224).
here. We will put them in our further work. Moreover, a
the confusion-matrix is shown in table 1.
simply classification testing has been done on our
implementation. The results show the basic working of our
Table 1. Confusion-matrix of Tone Classification
implementation. The 79.02% accuracy is reached. And big
for Thai Digit
confusion lies between tone 1 and tone 4. Anyway the work
described here is only based on the observation and the basic
Tone 1 2 4 Percent(%) testing. From the work that we described here, we still can’t
1 69 4 19 75 say it will work very well. The final evaluation needs us to
2 5 33 2 82.5 consider more and use it in the real tone-classification system.
4 15 2 75 81.52 And according to the results of classification performance the
79.02% methods will be evaluated. And it will be our further work.

From table 1, we can see the lose of accuracy mainly lies in 5 REFERENCES
the confusion between 1st tone and 4th tone. The reason for
this result may lie in the 5-Thai-tone contour which is shown [1]. L. R. Rabiner, M. J. Cheng, A. E. Rosenberg, and C. A.
in Fig. 15 and the effects of continuous speech. McGonegal. “A comparative performance study of several
pitch detection algorithms”. IEEE Transactions on Audio,
Signal, and Speech Processing 24, 399-417 1976.
[2]. M. M. Sondhi, “New methods of pitch extraction,” IEEE
Trans.Audio Electroacoust., vol. AU-16, pp. 262-266, June
1968.
[3]. Yi Kechu, Tian Fu, Fu Qiang, “YU YIN XIN HAO CHU
LI”, China Machine Press, BeiJing, 2000
[4]. M. J. Ross, H. L. Shaffer, A. Cohen, R. Freudberg, and H.
J.Manley, “Average magnitude difference function pitch
extractor,” IEEE Trans. Acoust., Speech, Signal Processing,
Fig. 14 Average F0 contours of the five Thai tones produced vol. ASSP-22, pp. 353-362, Oct. 1974.
in isolation (adapted from [9]) [5]. Lawrence R. Rabiner, "On the Use of Autocorrelation
Analysis for Pitch Detection" IEEE Trans. Acoust, Speech,
From here, we can see the initial level of tone 1 and tone 4 is Signal Processing, VOL. ASSP-25, NO. 1, 1977
similar. Also because of the continuous effect of speech, the [6]. X. Huang and A. Acero, H. Hon,” Spoken Language
trend of tone can’t meet the final level for tone 4 and it let the Processing: A Guide to Theory, Algorithm, and System
tone 1 end in a higher level than the isolated case. Also here Development”, Prentice Hall, 2001
only 4 feature is used in classification. So it’s possible the
accuracy will be improved if more feature is added. [7]. S.-H. Chen and Y.-R.Wang. “Vector quantization of
pitch information in Mandarin speech”. IEEE Transactions on
5. CONCLUSIONS Communications 38(9), 1317-1320 1990.
[8]. C. Wang, “Prosodic Modeling for Improved Speech
The work that we described here is the two pitch detection Recognition and Understanding”, Ph.D. dissertation, MIT,
algorithms and the related techniques including preprocessing June 2001
post-processing and extraction of pitch pattern. According to [9]. S. Potisuk, M. P. Harper., and J. Gandour. “Classification
our observing of the experiments. We found that both of Thai tone sequences in syllable-segmented speech using the
autocorrelation method and AMDF algorithm can provide the analysisby-synthesis method”, IEEE Transactions on Speech
accepted results. Through the observing of preprocessing and Audio Processing, 7(1): 95-02,1999.

You might also like