Measuring and Monitoring Speech Quality For Voice Over IP With PO
Measuring and Monitoring Speech Quality For Voice Over IP With PO
Measuring and Monitoring Speech Quality For Voice Over IP With PO
ARROW@TU Dublin
2015
Eoin Gillen
University of Dublin, Trinity College
Naomi Harte
University of Dublin, Trinity College
Part of the Digital Communications and Networking Commons, and the Signal Processing Commons
Recommended Citation
Hines, A., Gillen, E. & Harte, N. (2015). Measuring and Monitoring Speech Quality for Voice over IP with
POLQA, ViSQOL and P.563. Interspeech Conference, Dresden, Germany, September 6-10. doi:10.21427/
t1sg-k177
4 4 4
MOS SCORE
MOS SCORE
3 3 3
2 2 2
1 1 1
0 0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Condition # Condition # Condition #
4.5 4.5
4 4
3.5 3.5
MOS SCORE
MOS SCORE
3 3
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Condition # Condition #
(d) (e)
Figure 1: The subjective MOS results from the five degradation types. The MOS value for a condition is the average score given by all
listeners to the four speech samples affected by that condition. The error bars are 95% confidence intervals obtained using the method
in ITU-T Rec. P.1401 [9]. The 1st condition in each figure represents an the clean reference condition. The MNRUs are highlighted in
white. The bar charts highlight that conditions covering the ACR quality range were tested across each degradation type.
The database has been subjectively labelled with listener Table 1: Summary of Degradations and Parameters used in
tests complying with the ACR methodology presented in ITU-T TCD-VoIP
Rec. P.800 [10]. The TSP speech database from McGill Uni-
versity in Canada [11] provided the reference speech material. Degradation Conditions Parameters Range
It was recorded in an anechoic chamber and consists of speak- Rate 0-6 chops/s
ers reading sentences from the Harvard test sentence list. The Chop 20 Period 0.02-0.04 s
Mode Insert, Delete, Overwrite
speech samples in the TSP speech database are 16-bit WAV files Clip 10 Multiplier 1-55
sampled at 48 kHz. Competing
10
Gender code 1-5
Speaker SNR 10-50 dB
Full information on the degradation types in the TCD-VoIP Alpha 0-0.5
dataset can be found in [5] with a summary presented here in Echo 20
Delay 0-220 ms
Table 1. The database was designed to have each type of degra- Noise Type Car, Street, Office, Babble
Noise 20 SNR 5-55 dB
dation spanning the full MOS range, i.e. from Bad to Excel- MNRUs 4 SNR (Q) 48, 36, 24, 12
lent. The per-condition results, grouped by degradation class
are presented in Figure 1. For chop, the degradation varied ac-
cording to whether zeros were inserted to replace samples, or
whether they were deleted or overwritten with earlier samples. E-Model [13, 14] is an example of a parametric model and
The chop period refers to the length of the chopped sample; the is not investigated here as this paper is focused on evaluating
rate to how often the samples got chopped. To create clipped signal-based models. Full-reference models, can produce accu-
samples, a multiplier was applied to the original signal and val- rate measurements of speech quality by comparing a reference
ues over the maximum value were simply clipped to that level. and test signal. Monitoring models are no-reference and esti-
Competing speakers is treated as a separate issue to large crowd mate the speech quality from the test signal without access to a
babble noise as the speaker is intelligible and this is a common clean reference to compare against. With access to more infor-
VoIP call scenario. The gender and SNR level of the competing mation, full-reference metrics can generally produce more ac-
speaker were varied. For echo, the alpha value is the % am- curate predictions than no-reference metrics but are not suitable
plitude of the first delayed version of the signal relative to the for deployment as realtime monitoring metrics in VoIP applica-
original. The delay parameter was the number of ms before the tions.
first delayed version of the signal relative to the original. Four The performances of three objective metrics: POLQA,
types of noise were used: speech babble noise; car noise; road ViSQOL and P.563, have been compared in this study. Prior
noise and office noise. The SNR was also varied for these noise to the development of POLQA, ITU-T Rec. P.862 [15] pre-
degradations. sented PESQ, a full-reference metric designed to estimate the
Aside from the VoIP conditions, Modulated Noise Refer- quality of narrowband (300 - 3,400Hz) speech and networks. It
ence Unit (MNRU) conditions were also included in the tests first aligns the degraded and reference signals, and then com-
(see ITU-T Rec. P.810 [12] and [5] for further details). 24 pares both signals using a perceptual model. A subsequent re-
listeners were used in all experiments and each condition was vision to P.862 (P.862.2) [16] enabled PESQ to rate wideband
tested with 4 speakers (2 male and 2 female). (50 - 7,000Hz) signals. PESQ is widely used, however it is
inaccurate in some scenarios: suboptimal listening levels, loud-
ness loss, delays in conversational tests, talker echo or sidetone.
3. Objective Speech Quality Metrics These limitations are acknowledged in the recommendation.
As mentioned in the introduction, different application scenar- POLQA, introduced in ITU-T Rec. P.863 [4], is seen as
ios necessitate different speech quality models. The ITU G.107 a successor to PESQ, and was designed to conform to newer
industry requirements and address acknowledged shortcomings objective metrics. POLQA was tested using its super-wideband
of PESQ. The extended PESQ revision (P862.2) added wide- mode and ViSQOL was tested using its wideband speech mode.
band support to the model (50 – 7,000 Hz) while POLQA can As P.563 is a narrowband metric the degraded signals were re-
rate signals with bandwidths of 50 –14,000 Hz (superwideband sampled at 8 kHz for testing with the no-reference metric. The
signals). The basic philosophy used in POLQA is the same as test evaluated 384 sample speech files. For each condition, 4
that used in PESQ, i.e. the algorithm first aligns the degraded samples (2 male and 2 female speakers) were tested giving 96
and reference signals, and then compares both signals using a conditions. The objective mean opinion score (MOS-LQO) for
perceptual model. The POLQA algorithm also contains some the given condition was computed. The per-condition MOS-
additional steps and considerations designed to improve predic- LQS were used to benchmark the metrics.
tion accuracy of the metric. Despite this, some of the limitations
of PESQ (specifically in the cases of delays in conversational 5. Results and Discussion
tests, talker echo or sidetone) are still present in POLQA [17].
In this paper, we only assess the performance of POLQA for the The results for the objective metrics POLQA, ViSQOL and
VoIP scenarios, as its performance should be superior to PESQ. P.563 on the dataset (denoted by MOS-LQO) are compared to
ViSQOL [18] evolved from NSIM (Neurogram Similarity the subjective results (denoted by MOS-LQS) in Figure 2 bro-
Index Measure), a tool developed by Hines and Harte [19] to ken down by degradation condition. A statistical analysis of the
predict speech intelligibility for hearing-impaired listeners. An object metric prediction accuracy compared with the subjective
overview of NSIM and ViSQOL is given in Hines et al. [20], listener test results is presented in Table 2. Pearson correla-
and the performance of ViSQOL is compared to that of PESQ tion coefficients (⇢pearson ), Spearman rank order coefficients
for two common VoIP issues (clock drift and jitter). NSIM is a (⇢spearman ) and root mean squared error (RM SE) for each
full-reference metric which compares neurograms created from metric. The results are further broken down by condition type.
the reference and degraded signals. A simplified algorithm is As ViSQOL performed poorly with the CHOP data, two aggre-
used in ViSQOL, which compares spectrograms created from gated condition totals are displayed including and excluding the
Short-term Fourier Transforms (STFT) of both signals. NSIM CHOP conditions.
outputs a similarity score from 0 – 1, which is mapped to a POLQA performs well in predicting quality across all the
MOS-LQO scale. Metrics such as POLQA and ViSQOL are degradation types tested. With careful review of Figure 2, it can
useful when designing new VoIP systems or measuring and been seen that POLQA has a general trend of over-estimating
evaluating the performance of systems for particular scenar- quality for noise, echo and competing speakers. It tends to
ios. ViSQOL and POLQA have also recently been adapted underestimate for clip, with more consistent performance for
to be for audio quality evaluation [21]. This work builds on chop. Examining the statistics in Table 2 confirms POLQA’s
prior work where POLQA, PESQ and ViSQOL were evaluated ability to predict speech quality accurately across all condition
under a variety of narrowband speech scenarios with different classes in the TCD-VoIP database with high scores both in terms
datasets [22, 23, 24]. of Pearson correlation coefficients and Spearman rank correla-
P.563, introduced in ITU-T Rec. P.563 [25], is a no- tions.
reference metric designed to estimate the quality of narrow- The quality predictions from ViSQOL are well-correlated
band (100 – 3,100 Hz) speech. Sometimes referred to as single with the subjective results in the clip, competing speaker, echo
ended, no-reference metrics like P.563 predict speech quality and noise tests. There is a general trend visible that ViSQOL
without access to a clean reference signal. This class of met- underestimates quality for echo conditions. ViSQOL’s scores
ric is particularly useful in realtime monitoring scenarios where on the choppy speech are not well-matched to the subjective
no reference signal is available to compare against. P.563 was scores. An analysis of the individual conditions found that the
designed to account for the full range of degradations present insert and delete conditions account for the over-estimated clus-
in PSTNs. To rate a speech signal without a reference, P.563 ter (above the diagonal), while the underestimated cluster of 7
makes use of a large number of characteristic speech parame- chop conditions seen significantly below the diagonal was com-
ters, which can be split into 6 categories: basic speech descrip- posed exclusively of the overwrite chop sub-condition. When
tors, vocal tract analysis, speech statistics, static SNR, segmen- portions of the signal are overwritten, ViSQOL’s comparison
tal SNR and interruptions/mutes. The output score is based on algorithm can be tricked into aligning speech segments with the
these parameters. Output scores have been mapped to MOS val- overwritten repetition segment rather than the original segment.
ues using a set of speech clips and subjective test results. Some This causes mis-aligned comparisons with the reference spec-
of its limitations include talker echo, sidetones and singing trogram and results in a low speech quality estimate. Conditions
voice as well as limited testing with amplitude clipping during using the other two chop modes do not cause this problem, in
standardisation [3]. fact, ViSQOL marginally overestimated quality for these con-
The metrics chosen for this test are the current recom- ditions. Overall, the correlation statistics reveal that the per-
mended full-reference (POLQA) and no-reference (P.563) met- formance of ViSQOL and POLQA is close, particularly if the
rics and the full-reference metric ViSQOL that was developed chop conditions are not taken into account. This is useful for
to specifically target VoIP scenarios. It should be noted there researchers as ViSQOL is a freely available speech quality met-
are other speech quality metrics have been developed and are in ric.
common use that were not tested here (e.g. [26, 27, 16]). The P.563 was the only no-reference metric tested in this work.
three chosen metrics, provide a baseline benchmark for objec- As a no-reference metric, its scores were not expected to be as
tive full-reference and no-reference metrics on this dataset. well-matched as those of POLQA or ViSQOL. However, as can
been seen in Figure 2, P.563’s predictions bear almost no rela-
tion to the subjective results. It appears that the lowest (MOS
4. Metric Evaluation 2) subjective results also obtain the lowest P.563 results, but
The subjective listener test mean opinion scores (MOS-LQS) no further relationship can be discerned. Almost all of P.563’s
for the database were compared with predictions from the three results lie between MOS values of 2.5 and 3.5. Looking at Ta-
5 5 5
CHOP CHOP CHOP
4.5 CLIP 4.5 CLIP 4.5 CLIP
COMPSPKR COMPSPKR COMPSPKR
ECHO ECHO ECHO
4 NOISE 4 NOISE 4 NOISE
MOS−LQO
MOS−LQO
3 3 3
2 2 2
1 1 1
1 1.5 2 2.5 3 3.5 4 4.5 5 1 1.5 2 2.5 3 3.5 4 4.5 5 1 1.5 2 2.5 3 3.5 4 4.5 5
MOS−LQS MOS−LQS MOS−LQS
Table 2: Pearson correlation coefficients, Spearman rank correlations and RMSE per condition. The results are broken down by
degradation class with a grouped result for all conditions and a final grouping for all conditions excluding the chop condition.
ble 2, the only test in which a a positive trend can be seen is quality metrics can provide accurate predictions of speech qual-
the clipped speech test. This was somewhat of a surprise as it ity and could be used in developing and testing. The re-
was noted earlier that amplitude clipping was a condition with sults for the ITU recommended POLQA metric were consis-
limited testing during the metrics’s development. tent across all degradation classes, further validating its capa-
These results show that POLQA is capable of predicting bilities in a wide-range of speech scenarios. The tests also
the subjective MOS value of any condition in the TCD-VoIP highlighted alignment problems for ViSQOL when choppy data
dataset, although their predictions for some clipped speech may uses a overwrite strategy repeating a previous segment of the
be slightly low. ViSQOL is capable of predicting the subjective speech. This will be investigated further in future development
MOS values for speech with clipping, noise, competing speak- of the ViSQOL metric. For monitoring applications, the re-
ers or echo, but struggles with choppy speech, specifically in sults showed that the no-reference metric tested, P.563, could
cases where portions of the signal have been overwritten. not accurately predict quality to a usable level of accuracy. This
The results for P.563 show that it is incapable of predicting highlights an important unaddressed requirement for VoIP ap-
subjective MOS values for conditions in the dataset. P.563’s use plications, namely the need for a no-reference wideband speech
cases (listed by Möller et al. [3]) are mostly for detecting signal quality metric capable of monitoring VoIP applications. The
warping or network effects. Also, two of its limitations are echo authors are currently using the findings presented to address the
and clipping. From this, it can be concluded that P.563 is unsuit- challenge of monitoring VoIP quality with a realtime, wideband
able for the task of rating clips in TCD-VoIP. A gap exists for a no-reference metric.
wideband speech quality metric capable of monitoring VoIP ap-
plications as none of the no-reference objective quality metrics 7. Acknowledgements
currently available were specifically developed with this task in
Thanks to Google, Inc. and Enterprise Ireland for funding.
mind.
6. Conclusion
This paper reports benchmarking results of three speech quality
metrics on a VoIP speech database. Two full-reference signal-
based metrics were evaluated to establish their accuracy as mea-
surement tools for speech quality. The results showed that for
the classes of VoIP degradation tested, full-reference speech
8. References [21] A. Hines, E. Gillen, J. Skoglund, A. Kokaram, and N. Harte,
“Visqolaudio: An objective audio quality metric for low bitrate
[1] Alcatel-Lucent, “PSTN industry analysis and service provider
codecs,” The Journal of the Acoustical Society of America, vol.
strategies: Synopsis,” http://goo.gl/tTPFes, Alcatel-Lucent, Paris,
137:6, June 2015.
France, Tech. Rep. Bell Labs Analysis for BT, 2013.
[22] A. Hines, J. Skoglund, A. Kokaram, and N. Harte, “Robustness of
[2] L. K. Vanston and R. L. Hodges, “Forecasts for the us telecom- speech quality metrics to background noise and network degra-
munications network,” Telektronnik, vol. 104, no. 3/4, pp. 18–28, dations: Comparing ViSQOL, PESQ and POLQA,” in Acous-
2008. tics, Speech and Signal Processing (ICASSP), 2013 IEEE Inter-
[3] S. Möller, W.-Y. Chan, N. Côté, T. H. Falk, A. Raake, and M. Wal- national Conference on, 2013.
termann, “Speech quality estimation: Models and trends,” Signal [23] A. Hines, P. Pocta, and H. Melvin, “Detailed analysis of PESQ
Processing Magazine, IEEE, vol. 28, no. 6, pp. 18–28, 2011. and VISQOL behaviour in the context of playout delay adjust-
[4] ITU, “Perceptual objective listening quality assessment,” Int. ments introduced by VoIP jitter buffer algorithms,” in Quality
Telecomm. Union, Geneva, Switzerland, ITU-T Rec. P.863, 2011. of Multimedia Experience (QoMEX), Klagenfurt am Wörthersee,
Austria, 2013.
[5] N. Harte, E. Gillen, and A. Hines, “TCD-VoIP, a research
database of degraded speech for assessing quality in voip appli- [24] P. Pocta, H. Melvin, and A. Hines, “An analysis of the impact
cations,” in Quality of Multimedia Experience (QoMEX), Costa of playout delay adjustments introduced by VoIP jitter buffers on
Navarino, Greece, 2015. speech quality,” Acta Acoustica united with Acustica, vol. 101,
no. 2, May–June 2015.
[6] M. Soloducha and A. Raake, “Speech quality of VoIP: bursty
[25] ITU, “Single-ended method for objective speech quality assess-
packet loss revisited,” in Speech Communication; 11. ITG Sym-
ment in narrow-band telephony applications,” Int. Telecomm.
posium; Proceedings of, Sept 2014, pp. 1–4.
Union, Geneva, Switzerland, ITU-T Rec. P.563, 2011.
[7] J. Holub and O. Slavata, “Impact of IP channel parameters on the [26] ANSI ATIS, “0100005-2006: Auditory non-intrusive quality esti-
final quality of the transferred voice,” in Wireless Telecommunica- mation plus (ANIQUE+): Perceptual model for non-intrusive es-
tions Symposium (WTS), 2012, April 2012, pp. 1–5. timation of narrowband speech quality,” 2006.
[8] J. Zhu, R. Vannithamby, C. Rodbro, M. Chen, and S. Vang Ander- [27] V. Grancharov, D. Zhao, J. Lindblom, and W. Kleijn, “Low-
sen, “Improving QoE for Skype video call in mobile broadband complexity, nonintrusive speech quality assessment,” Audio,
network,” in Global Communications Conference (GLOBECOM), Speech, and Language Processing, IEEE Transactions on, vol. 14,
2012 IEEE, Dec 2012, pp. 1938–1943. no. 6, pp. 1948–1956, Nov 2006.
[9] ITU, “Methods, metrics and procedures for statistical evaluation,
qualification and comparison of objective quality prediction mod-
els,” Int. Telecomm. Union, Geneva, Switzerland, Tech. Rep.
ITU-T Rec. P.1401, 2012.
[10] ——, “Methods for subjective determination of transmission
quality,” Int. Telecomm. Union, Geneva, Switzerland, Tech. Rep.
ITU-T Rec. P.800, 1996.
[11] P. Kabal, “Tsp speech database,” McGill University, Quebec,
Canada, Tech. Rep. Database Version 1.0, 2002.
[12] ITU, “Modulated noise reference unit (mnru),” Int. Telecomm.
Union, Geneva, Switzerland, Tech. Rep. ITU-T Rec. P.810, 1996.
[13] ——, “The E-model, a computational model for use in transmis-
sion planning,” Int. Telecomm. Union, Geneva, Switzerland, ITU-
T Rec. G.107, 2009.
[14] ——, “Wideband E-model,” Int. Telecomm. Union, Geneva,
Switzerland, ITU-T Rec. G.107.1, 2011.
[15] ——, “Perceptual evaluation of speech quality (PESQ): an objec-
tive method for end-to-end speech quality assessment of narrow-
band telephone networks and speech codecs,” Int. Telecomm.
Union, Geneva, Switzerland, ITU-T Rec. P.862, 2001.
[16] ——, “Wideband extension to recommendation P.862 for the as-
sessment of wideband telephone networks and speech codecs,”
Int. Telecomm. Union, Geneva, Switzerland, ITU-T Rec. P.862.2,
2005.
[17] ——, “Perceptual objective listening quality assessment,” Int.
Telecomm. Union, Geneva, Switzerland, Tech. Rep. ITU-T Rec.
P.863, 2011.
[18] A. Hines, J. Skoglund, A. C. Kokaram, and N. Harte, “Visqol:
an objective speech quality model,” EURASIP Journal on Audio,
Speech, and Music Processing, vol. 2015:13, May 2015.
[19] A. Hines and N. Harte, “Speech intelligibility prediction using
a neurogram similarity index measure,” Speech Communication,
vol. 54, no. 2, pp. 306 – 320, 2012.
[20] A. Hines, J. Skoglund, A. Kokaram, and N. Harte, “ViSQOL: The
virtual speech quality objective listener,” in Acoustic Signal En-
hancement; Proceedings of IWAENC 2012; International Work-
shop on, Sept 2012, pp. 1–4.