Instantaneous Pitch Estimation Algorithm Based On Multirate Sampling
Instantaneous Pitch Estimation Algorithm Based On Multirate Sampling
MULTIRATE SAMPLING
𝑆𝑘 𝑛 = 𝑘 (𝑖)𝑠 𝑛 − 𝑖 . (3)
𝑖=−∞
Instantaneous parameters of subband components are
available from the following expressions:
𝐴𝑘 𝑛 = 𝑅2 𝑛 + 𝐼2 𝑛 , (4)
−𝐼 𝑛
𝜑𝑘 𝑛 = arctan , 𝜔𝑘 𝑛 = 𝜑𝑘′ 𝑛 , (5)
𝑅 𝑛
Figure 1 – Adjusting analysis filter bank to each period
where 𝑅(𝑛) and 𝐼(𝑛) are real and imaginary parts of 𝑆𝑘 (𝑛) candidate. (a) – period candidate generation function,
respectively. To avoid phase discontinuities in (5) phase is (b) – amplitude spectrum and filter bank for candidate 𝜔10 ,
unwrapped. (c) – filter bank for candidate 𝜔02 , (d) – source signal
Instantaneous parameters are used as initial data for
PCGF evaluation. Assuming that possible pitch variation is Considering that only a few harmonics are of practical
proportional to the current pitch value, parameters of the importance it is possible to use very short analysis frames.
filter bank should be scaled for each period candidate: Assuming (8) the parameters of the filter bank become
𝜔𝑠𝑡𝑒𝑝 𝜔0 = 𝜔0 , 𝜔𝑏𝑤 𝜔0 = 𝛼𝜔0 (6) 𝜔𝑠𝑡𝑒𝑝 = 2𝜋/𝑅, 𝜔𝑏𝑤 = 𝛼𝜔𝑠𝑡𝑒𝑝 (11)
where 𝜔0 – frequency of the candidate in radians per sample The desired parameters are extracted from the signal
and α – allowed relative pitch variation. Time duration of using the multirate sampling scheme as shown in figure 2.
the analysis frame should be adjusted accordingly and
contain a fixed number of periods:
𝑁 = 2𝜋𝐿/𝜔0 , (7)
where 𝑁 – number of samples and 𝐿 – number of periods in
the frame. The idea is illustrated in figure 1.
Changing parameters of the filter bank for each period
candidate is computationally expensive. Alternatively it is
possible to use a filter bank with fixed parameters and
change sampling frequency of the signal. Let us specify
sampling frequency 𝐹𝑠 as a multiple of the frequency of the
Figure 2 – Multirate sampling scheme for pitch period
period candidate: candidates generation (M – number of period candidates)
𝐹𝑠 = 𝑅𝑓0 , (8)
where 𝑓0 – frequency of the candidate in Hz and R – integer 3. PERIOD CANDIDATE GENERATION FUNCTION
factor. Using (8) and considering that 𝑓0 = 𝜔0 𝐹𝑠 /(2𝜋) eq.
(7) results in fixed analysis frame length for all period For period candidates generation an autocorrelation-based
candidates: measure is used. In [12] was introduced the normalized
𝑁 = 𝑅𝐿 . (9) cross-correlation function (NCCF)
Factor R determines how many harmonic are retained in 𝑁+𝑙−1
𝑖=0 𝑠 𝑛 𝑠(𝑛 + 𝑙)
the resampled signal: 𝜙 𝑙 = , (12)
𝑒 0 𝑒(𝑙)
(𝑅 − 1)/2 , for odd 𝑅
𝐾= (10) where l – lag in samples, 𝑒(𝑙) = 𝑁+𝑙−1 𝑠 2 (𝑖). The function
𝑅/2 − 1, for even 𝑅. 𝑖=𝑙
averages data within analysis frame and gives smoothed
4971
values. In order to improve time resolution the instantaneous 4. PITCH ESTIMATION ALGORITHM
model-based version of the function can be used [7]:
𝐾
𝐴𝑘 𝑛 2
cos(𝜔𝑘 𝑛 𝑙) The proposed algorithm consists of the following steps1:
𝑘=1
𝜙𝑖𝑛𝑠𝑡 𝑛, 𝑙 = 𝐾 2
. (13) 1) resample the input signal frame 𝑠(𝑛) for each period
𝑘=1 𝐴𝑘 𝑛 candidate using corresponding sampling rate (8);
The function assumes that the bandwidth of each analysis 2) normalize energy of each resampled frame to 1;
filter is narrower than the minimum possible pitch value and 3) estimate instantaneous harmonic parameter using
therefore the harmonics are always separated. The multirate equations (2)–(5); this step is implemented using 2𝑉 + 1
analysis scheme, described above, suffers from harmonic overlapping discrete Fourier transforms for each resampled
mixing (that occurs for high frequency candidates when frame; the Hamming widow is used as window function
processing low-pitched sounds) which causes sporadic 𝑤 𝑛 in (2);
amplification of high frequency regions of 𝜙𝑖𝑛𝑠𝑡 (). In order 4) evaluate (14) for each period candidate from the
to reduce the impact of harmonic mixing the following correspondent set of parameters;
period candidate generating function is used that multiplies 5) multiply obtained period candidate values with a
obtained measures of 2𝑉 + 1 adjacent samples: weighting window, penalizing low-frequency candidates:
𝑉 𝐾 𝜔
𝑤𝑤𝑖𝑔 𝑡 𝜔0 = 0.2 0 + 0.8;
𝜋
𝜙𝑚𝑠 𝑛, 𝑙 = 𝐴𝑘 𝑛 + 𝑣 cos 𝜔𝑘 𝑛 + 𝑣 𝑙 . (14) 6) find the best locally continuous contour maximizing
𝑣=−𝑉 𝑘=1 total period candidate values for adjacent frames with
For each lag 𝑙 the model parameters 𝐴𝑘 and 𝜔𝑘 are dynamic programming; this step results in selection of the
estimated on a separate channel of the multirate scheme best current candidate 𝜔0,𝑏𝑒𝑠𝑡 (𝑛) – a rough pitch estimate;
𝑅
with sampling factor according to (8). Energy of each 7) calculate fine pitch value 𝜔0,𝑓𝑖𝑛𝑒 (𝑛) using
𝑙
resampled frame is normalized to 1 in order to equalize instantaneous harmonic parameters extracted for the best
values of 𝜙𝑚𝑠 () for different candidates. Using non-squared candidate using weighted sum:
amplitudes in (14) instead of squared amplitudes in (13) 𝐾
1 1
generally is more robust since contribution of harmonic 𝜔0,𝑓𝑖𝑛𝑒 (𝑛) = 𝐾 𝜔 𝑛 𝐴𝑘 (𝑛) (15)
amplitudes becomes more balanced. Normally the effect of 𝑘=1 𝐴𝑘 (𝑛) 𝑘 𝑘
𝑘=1
harmonic mixing emerges on short time periods and can be Considering that the filter bank is implemented using the
significantly reduced by multiplication of just a few terms. fast Fourier transform overall computational complexity of
Figure 3 shows period candidates, generated by 𝜙(), 𝜙𝑖𝑛𝑠𝑡 () the algorithm (multiplications per one pitch estimate) can be
and the proposed function 𝜙𝑚𝑠 () for a short speech approximately expressed as 𝑂(𝐼𝐾𝑁 + 2(𝑉 + 1)𝐾𝑁𝑙𝑜𝑔 𝑁 ),
fragment. Function 𝜙𝑚𝑠 () clearly provides much higher where I – length of the low-pass interpolation filter used for
frequency and time resolution compared to 𝜙() and 𝜙𝑖𝑛𝑠𝑡 (). resampling.
For practical implementation of the algorithm we used
the following values 𝐾 = 8, 𝐿 = 4, 𝑅 = 2𝐾 + 1 =17,
𝑁 = 68, 𝑀 = 100, 𝐼 = 121, 𝑉 = 1. The allowed pitch
range is between 50 and 450Hz. The pitch range is separated
uniformly in logarithmic scale into 100 points each
corresponding to one candidate. Duration of resampled
frames vary from 80ms (the longest period candidate) to
9ms (the shortest period candidate).
5. SIMULATION RESULTS
4972
artificial signals with changing pitch in the range from 100 Male speech Female speech
to 350 Hz. All obtained measurements were separated into GPE% MFPE% GPE% MFPE%
six groups distinguished by variation rate: 0–0.3, 0.3–0.6, RAPT 3.687 1.737 6.068 1.184
0.6–0.9, 0.9–1.2, 1.2–1.5, >1.5 percent of pitch change per YIN 3.184 1.389 3.960 0.835
SWIPE‟ 0.756 1.505 4.273 0.800
millisecond. Averaged errors are shown in figure 4. PEFAC 20.521 1.383 31.192 0.972
IRAPT 1 1.625 1.608 3.777 0.977
Halcyon 0.743 1.268 3.600 1.039
Table 1 – Pitch estimation error (natural speech)
Noisy test samples were generated using two types of
noise (white and babble) with SNRs from -20 to 20dB. The
averaged results for noisy samples are shown in figures 6,7.
6. CONCLUSIONS
4973
8. REFERENCES Acoustical Society of America, vol. 123, no. 4, pp. 1638–
1652, 2008.
[1] F. Zhang, G. Bi, Y. Q. Chen, “Harmonic transform,” in [15] S. Gonzalez, M. Brookes, “PEFAC – A Pitch Estimation
Vision, Image and Signal Processing, IEE Proceedings, Algorithm Robust to High Levels of Noise,” IEEE/ACM
vol.151, no.4, pp.257–263, 2004. Transactions on Audio, Speech, and Language Processing,
[2] R. J. McAulay, T. F. Quatieri, “Speech analysis/synthesis vol. 22, no.2, pp. 518–530, 2014.
based on a sinusoidal representation,” IEEE Transactions on [16] G. Pirker, M. Wohlmayr, S. Petrik, F. Pernkopf “A Pitch
Acoustics, Speech and Signal Processing, vol. 34, no. 4, pp. Tracking Corpus with Evaluation on Multipitch Tracking
744–754, 1986. Scenario,” in INTERSPEECH 2011 –12th Annual Conference
[3] J. Laroche, Y. Stylianou, E. Moulines, “HNS: Speech of the International Speech Communication Association,
modification based on a harmonic+noise model,” in ICASSP- August 28–31, Lyon, France, Proceedings, 2011. – pp. 1509–
93 – IEEE International Conference on Acoustic, Speech, and 1512.
Signal Processing, April 27-30, Minneapolis, USA,
Proceedings, 1993. – pp. 550–553.
[4] J. O. Hong, P. J. Wolfe, “Model-based estimation of
instantaneous pitch in noisy speech,” in INTERSPEECH
2009 – 10th Annual Conference of the International Speech
Communication Association, September 6–10, Brighton, UK,
Proceedings, 2009. – pp. 112–115.
[5] B. Resch, M. Nilsson, A. Ekman and W. B. Klejin
“Estimation of the Instantaneous Pitch of Speech,” IEEE
Transactions on Audio, Speech and Language Processing,
vol. 15, no. 15, pp. 819–822, 2007.
[6] T. Abe, T. Kobayashi, S. Imai, “Harmonics tracking and pitch
extraction based on instantaneous frequency,” in ICASSP-95 –
IEEE International Conference on Acoustic, Speech, and
Signal Processing, May 9-12, Detroit, USA, Proceedings,
1995. – pp. 756–759.
[7] E. Azarov, M. Vashkevich, A. Petrovsky, “Instantaneous
pitch estimation based on RAPT framework,” in EUSIPCO'12
–European Signal Processing Conference, August 27-31,
Bucharest, Romania, Proceedings, 2012. – pp. 2787–2791.
[8] E. Azarov, M. Vashkevich, A. Petrovsky, “Instantaneous
harmonic representation of speech using multicomponent
sinusoidal excitation,” in INTERSPEECH 2013 –14th Annual
Conference of the International Speech Communication
Association, August 25–29, Lyon, France, Proceedings, 2013.
– pp. 1697–1701.
[9] E. Azarov, M. Vashkevich, A. Petrovsky, “Guslar: A
framework for automated singing voice correction,” in
ICASSP-2014 – IEEE International Conference on Acoustic,
Speech, and Signal Processing, May 4-9, Florence, Italy,
Proceedings, 2014. – pp. 7919–7923.
[10] K. Hotta, K. Funaki, “On a Robust F0 Estimation of Speech
based on IRAPT using Robust TV-CAR Analysis,” in
APSIPA 2014 – Annual Summit and Conference Asia-Pacific
Signal and Information Processing Association, 2014,
December 9–12, Siem Reap, Cambodia, Proceedings, 2014. –
pp. 1–4.
[11] E. van den Berg, B. Ramabhadran, “Dictionary-based pitch
tracking with dynamic programming,” in INTERSPEECH
2014 –15th Annual Conference of the International Speech
Communication Association, September 14–18, Singapore,
Proceedings, 2014. – pp. 1347–1351.
[12] D. Talkin, “A Robust Algorithm for Pitch Tracking (RAPT)"
in "Speech Coding & Synthesis,” W B Kleijn, K K Paliwal
eds, Elsevier ISBN 0444821694, 1995.
[13] A. Cheveigné, H. Kawahara “YIN, a fundamental frequency
estimator for speech and music,” Journal of the Acoustical
Society of America, vol. 111, no. 4, pp. 1917–1930, 2002.
[14] A. Camacho, J. G. Harris, “A sawtooth waveform inspired
pitch estimator for speech and music,” Journal of the
4974